GoFuckYourself.com - Adult Webmaster Forum - Tech 4chan /b/, 2 days, 56 GB, 121830 media files.

- Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)

- - Tech 4chan /b/, 2 days, 56 GB, 121830 media files. (https://gfy.com/showthread.php?t=1189650)

4chan /b/, 2 days, 56 GB, 121830 media files.

.jpeg, .png, .gif, .webm

I'm starting to realize the power of well thought out bots.

Before anyone gets on me about being liberal with scraping and content, I thought it would be awesome to hash images in multiple formats and create a database on these hashes, then find out what images are popular (as they would appear multiple times) and see if that was a sponsor of some kind.
Google works this way with their image search but this could be useful for affiliates and/or programs.

Meaning? perhaps a good idea to promote them?

This is my bot source, crowd sourced, way of telling what's hot without anyone directly telling me. Weeeeeeeeeee!!!!!!

Simple script, It's brute force which means if the thread is seen again it will overwrite all the files again, it's probably pulled 200GB of files. Will modify in the future.

Cool stuff, you are happy that I shared!!!!! :)

Other idea is setting up something where if there were an image and you can't quite figure out the rest of the series, well search the hash and it'd pull up a thread from that search that just may have those images.

FUCKING AWESOME IDEAS!!!!

http://i.imgur.com/DKfVTRn.png

Lapd.mobi seems to scrape 4chan threads. I searched an image and it came up with it.

Thought you were going to share the bot.
Like your stuff though, intelligent.

Quote:

Originally Posted by johnnyloadproductions (Post 20796042)

that's fucking awesome bro! now how much cp you got on your pc?

Quote:

Originally Posted by clickity click (Post 20796452)

Thought you were going to share the bot.

Here. You can run this on digital ocean or any PC on your house, you'll just have to have Python installed and run a cron job.

I have a cronjob setup to run every 10 minutes and it parses the front page threads and then follows and scrapes the media in those threads.

Setup a cronjob:
crontab -e
or
sudo crontab -e

then at the bottom of the file input:
*/10 * * * * python /home/4chan/main.py

Code:

# Edit this file to introduce tasks to be run by cron.

#

# Each task to run has to be defined through a single line

# indicating with different fields when the task will be run

# and what command to run for the task

#

# To define the time you can provide concrete values for

# minute (m), hour (h), day of month (dom), month (mon),

# and day of week (dow) or use '*' in these fields (for 'any').#

# Notice that tasks will be started based on the cron's system

# daemon's notion of time and timezones.

#

# Output of the crontab jobs (including errors) is sent through

# email to the user the crontab file belongs to (unless redirected).

#

# For example, you can run a backup of all your user accounts

# at 5 a.m every week with:

# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/

#

# For more information see the manual pages of crontab(5) and cron(8)

#

# m h  dom mon dow   command

0 0 */2 * * python /home/craigslist/Rasp1/final_craigslist.py

*/10 * * * * python /home/4chan/main.py

Code:

from bs4 import BeautifulSoup

import urllib

import urllib2

import os



import time



start = time.time()



def get_threads(url, board, dir):





    response = urllib2.urlopen(url)

    soup = BeautifulSoup(response.read(), "lxml")



    links = soup.find_all("a", attrs={"class": "replylink"})



    for link in links:

        link_string = link['href']

        thread_id = link_string.split('/')

        print thread_id[1]

        thread_url = "http://boards.4chan.org/" + dir + "/thread/" + thread_id[1]

        thread_response = urllib2.urlopen(thread_url)



        image_urls = BeautifulSoup(thread_response.read(), "lxml")

        images = image_urls.find_all("a", attrs={"class": "fileThumb"})

        # Chage this to the path directory you want to save it to. This was for a usb drive.

        directory = os.path.dirname("/media/4chan/" + thread_id[1])





        if not os.path.exists(directory + "/thread/" + thread_id[1]):

            os.makedirs(directory + "/thread/" + thread_id[1])

        for image in images:

            string = image['href']

            one = string.split('/b/')

            urllib.urlretrieve("http:" + image['href'], directory + "/thread/" + thread_id[1] + "/" + one[1])







prepend = ["boards",]

append = ['b',]



for dir in append:

    for board in prepend:

        print board

        url = "http://{}.4chan.org/{}".format(board, dir)

        print "This is the directory: " + dir

        get_threads(url, board, dir)





end = time.time()

print(end - start)

that's a big order of cheese pizza.

I have Py installed, I will try to get it working if I have time.

Quote:

Originally Posted by xXXtesy10 (Post 20796458)

that's fucking awesome bro! now how much cp you got on your pc?

Inb4 the cops are knocking on johnny's door...again.

Quote:

Originally Posted by MrBottomTooth (Post 20797349)

No shit. That's the last place I'd be scraping images from.

I wonder if Johnny was hangin at the local college library scraping /b/ @ 2am via a bought student id :1orglaugh

What happened to the bot?

Quote:

Originally Posted by MrBottomTooth (Post 20797349)

No shit. That's the last place I'd be scraping images from.

:1orglaugh:1orglaugh:1orglaugh:thumbsup

the bro thing do now is put up paywall and charge for those... list every adult producer as 2257 page.