Quote:
Originally Posted by clickity click
Thought you were going to share the bot.
|
Here. You can run this on digital ocean or any PC on your house, you'll just have to have Python installed and run a cron job.
I have a cronjob setup to run every 10 minutes and it parses the front page threads and then follows and scrapes the media in those threads.
Setup a cronjob:
crontab -e
or
sudo crontab -e
then at the bottom of the file input:
*/10 * * * * python /home/4chan/main.py
Code:
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
0 0 */2 * * python /home/craigslist/Rasp1/final_craigslist.py
*/10 * * * * python /home/4chan/main.py
Code:
from bs4 import BeautifulSoup
import urllib
import urllib2
import os
import time
start = time.time()
def get_threads(url, board, dir):
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), "lxml")
links = soup.find_all("a", attrs={"class": "replylink"})
for link in links:
link_string = link['href']
thread_id = link_string.split('/')
print thread_id[1]
thread_url = "http://boards.4chan.org/" + dir + "/thread/" + thread_id[1]
thread_response = urllib2.urlopen(thread_url)
image_urls = BeautifulSoup(thread_response.read(), "lxml")
images = image_urls.find_all("a", attrs={"class": "fileThumb"})
# Chage this to the path directory you want to save it to. This was for a usb drive.
directory = os.path.dirname("/media/4chan/" + thread_id[1])
if not os.path.exists(directory + "/thread/" + thread_id[1]):
os.makedirs(directory + "/thread/" + thread_id[1])
for image in images:
string = image['href']
one = string.split('/b/')
urllib.urlretrieve("http:" + image['href'], directory + "/thread/" + thread_id[1] + "/" + one[1])
prepend = ["boards",]
append = ['b',]
for dir in append:
for board in prepend:
print board
url = "http://{}.4chan.org/{}".format(board, dir)
print "This is the directory: " + dir
get_threads(url, board, dir)
end = time.time()
print(end - start)