GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   Tech 4chan /b/, 2 days, 56 GB, 121830 media files. (https://gfy.com/showthread.php?t=1189650)

johnnyloadproductions 03-23-2016 11:18 AM

4chan /b/, 2 days, 56 GB, 121830 media files.
 
.jpeg, .png, .gif, .webm

I'm starting to realize the power of well thought out bots.

Before anyone gets on me about being liberal with scraping and content, I thought it would be awesome to hash images in multiple formats and create a database on these hashes, then find out what images are popular (as they would appear multiple times) and see if that was a sponsor of some kind.
Google works this way with their image search but this could be useful for affiliates and/or programs.

Meaning? perhaps a good idea to promote them?

This is my bot source, crowd sourced, way of telling what's hot without anyone directly telling me. Weeeeeeeeeee!!!!!!

Simple script, It's brute force which means if the thread is seen again it will overwrite all the files again, it's probably pulled 200GB of files. Will modify in the future.

Cool stuff, you are happy that I shared!!!!! :)

Other idea is setting up something where if there were an image and you can't quite figure out the rest of the series, well search the hash and it'd pull up a thread from that search that just may have those images.

FUCKING AWESOME IDEAS!!!!

http://i.imgur.com/DKfVTRn.png

gnawledge 03-23-2016 02:20 PM

Lapd.mobi seems to scrape 4chan threads. I searched an image and it came up with it.

clickity click 03-23-2016 02:35 PM

Thought you were going to share the bot.
Like your stuff though, intelligent.

xXXtesy10 03-23-2016 02:36 PM

Quote:

Originally Posted by johnnyloadproductions (Post 20796042)
.jpeg, .png, .gif, .webm

I'm starting to realize the power of well thought out bots.

Before anyone gets on me about being liberal with scraping and content, I thought it would be awesome to hash images in multiple formats and create a database on these hashes, then find out what images are popular (as they would appear multiple times) and see if that was a sponsor of some kind.
Google works this way with their image search but this could be useful for affiliates and/or programs.

Meaning? perhaps a good idea to promote them?

This is my bot source, crowd sourced, way of telling what's hot without anyone directly telling me. Weeeeeeeeeee!!!!!!

Simple script, It's brute force which means if the thread is seen again it will overwrite all the files again, it's probably pulled 200GB of files. Will modify in the future.

Cool stuff, you are happy that I shared!!!!! :)

Other idea is setting up something where if there were an image and you can't quite figure out the rest of the series, well search the hash and it'd pull up a thread from that search that just may have those images.

FUCKING AWESOME IDEAS!!!!

http://i.imgur.com/DKfVTRn.png

that's fucking awesome bro! now how much cp you got on your pc?

johnnyloadproductions 03-23-2016 03:08 PM

Quote:

Originally Posted by clickity click (Post 20796452)
Thought you were going to share the bot.

Here. You can run this on digital ocean or any PC on your house, you'll just have to have Python installed and run a cron job.

I have a cronjob setup to run every 10 minutes and it parses the front page threads and then follows and scrapes the media in those threads.

Setup a cronjob:
crontab -e
or
sudo crontab -e

then at the bottom of the file input:
*/10 * * * * python /home/4chan/main.py

Code:

# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow  command
0 0 */2 * * python /home/craigslist/Rasp1/final_craigslist.py
*/10 * * * * python /home/4chan/main.py



Code:

from bs4 import BeautifulSoup
import urllib
import urllib2
import os

import time

start = time.time()

def get_threads(url, board, dir):


    response = urllib2.urlopen(url)
    soup = BeautifulSoup(response.read(), "lxml")

    links = soup.find_all("a", attrs={"class": "replylink"})

    for link in links:
        link_string = link['href']
        thread_id = link_string.split('/')
        print thread_id[1]
        thread_url = "http://boards.4chan.org/" + dir + "/thread/" + thread_id[1]
        thread_response = urllib2.urlopen(thread_url)

        image_urls = BeautifulSoup(thread_response.read(), "lxml")
        images = image_urls.find_all("a", attrs={"class": "fileThumb"})
        # Chage this to the path directory you want to save it to. This was for a usb drive.
        directory = os.path.dirname("/media/4chan/" + thread_id[1])


        if not os.path.exists(directory + "/thread/" + thread_id[1]):
            os.makedirs(directory + "/thread/" + thread_id[1])
        for image in images:
            string = image['href']
            one = string.split('/b/')
            urllib.urlretrieve("http:" + image['href'], directory + "/thread/" + thread_id[1] + "/" + one[1])



prepend = ["boards",]
append = ['b',]

for dir in append:
    for board in prepend:
        print board
        url = "http://{}.4chan.org/{}".format(board, dir)
        print "This is the directory: " + dir
        get_threads(url, board, dir)


end = time.time()
print(end - start)


ITraffic 03-23-2016 03:12 PM

that's a big order of cheese pizza.

clickity click 03-23-2016 03:23 PM

I have Py installed, I will try to get it working if I have time.

OneHungLo 03-23-2016 11:03 PM

Quote:

Originally Posted by xXXtesy10 (Post 20796458)
that's fucking awesome bro! now how much cp you got on your pc?

Inb4 the cops are knocking on johnny's door...again.

MrBottomTooth 03-23-2016 11:21 PM

No shit. That's the last place I'd be scraping images from.

OneHungLo 03-23-2016 11:35 PM

Quote:

Originally Posted by MrBottomTooth (Post 20797349)
No shit. That's the last place I'd be scraping images from.

I wonder if Johnny was hangin at the local college library scraping /b/ @ 2am via a bought student id :1orglaugh

OneHungLo 03-24-2016 10:47 AM

What happened to the bot?

xXXtesy10 03-24-2016 10:49 AM

Quote:

Originally Posted by MrBottomTooth (Post 20797349)
No shit. That's the last place I'd be scraping images from.

:1orglaugh:1orglaugh:1orglaugh:thumbsup

the bro thing do now is put up paywall and charge for those... list every adult producer as 2257 page.

bns666 03-24-2016 10:51 AM

nice :thumbsup


All times are GMT -7. The time now is 07:34 AM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123