GoFuckYourself.com - Adult Webmaster Forum - View Single Post - Tech 4chan /b/, 2 days, 56 GB, 121830 media files.

johnnyloadproductions · 03-23-2016, 03:08 PM

Quote:

Originally Posted by clickity click

Thought you were going to share the bot.

Here. You can run this on digital ocean or any PC on your house, you'll just have to have Python installed and run a cron job.

I have a cronjob setup to run every 10 minutes and it parses the front page threads and then follows and scrapes the media in those threads.

Setup a cronjob:
crontab -e
or
sudo crontab -e

then at the bottom of the file input:
*/10 * * * * python /home/4chan/main.py

Code:

# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow   command
0 0 */2 * * python /home/craigslist/Rasp1/final_craigslist.py
*/10 * * * * python /home/4chan/main.py

Code:

from bs4 import BeautifulSoup
import urllib
import urllib2
import os

import time

start = time.time()

def get_threads(url, board, dir):


    response = urllib2.urlopen(url)
    soup = BeautifulSoup(response.read(), "lxml")

    links = soup.find_all("a", attrs={"class": "replylink"})

    for link in links:
        link_string = link['href']
        thread_id = link_string.split('/')
        print thread_id[1]
        thread_url = "http://boards.4chan.org/" + dir + "/thread/" + thread_id[1]
        thread_response = urllib2.urlopen(thread_url)

        image_urls = BeautifulSoup(thread_response.read(), "lxml")
        images = image_urls.find_all("a", attrs={"class": "fileThumb"})
        # Chage this to the path directory you want to save it to. This was for a usb drive.
        directory = os.path.dirname("/media/4chan/" + thread_id[1])


        if not os.path.exists(directory + "/thread/" + thread_id[1]):
            os.makedirs(directory + "/thread/" + thread_id[1])
        for image in images:
            string = image['href']
            one = string.split('/b/')
            urllib.urlretrieve("http:" + image['href'], directory + "/thread/" + thread_id[1] + "/" + one[1])



prepend = ["boards",]
append = ['b',]

for dir in append:
    for board in prepend:
        print board
        url = "http://{}.4chan.org/{}".format(board, dir)
        print "This is the directory: " + dir
        get_threads(url, board, dir)


end = time.time()
print(end - start)