![]() |
4chan /b/, 2 days, 56 GB, 121830 media files.
.jpeg, .png, .gif, .webm
I'm starting to realize the power of well thought out bots. Before anyone gets on me about being liberal with scraping and content, I thought it would be awesome to hash images in multiple formats and create a database on these hashes, then find out what images are popular (as they would appear multiple times) and see if that was a sponsor of some kind. Google works this way with their image search but this could be useful for affiliates and/or programs. Meaning? perhaps a good idea to promote them? This is my bot source, crowd sourced, way of telling what's hot without anyone directly telling me. Weeeeeeeeeee!!!!!! Simple script, It's brute force which means if the thread is seen again it will overwrite all the files again, it's probably pulled 200GB of files. Will modify in the future. Cool stuff, you are happy that I shared!!!!! :) Other idea is setting up something where if there were an image and you can't quite figure out the rest of the series, well search the hash and it'd pull up a thread from that search that just may have those images. FUCKING AWESOME IDEAS!!!! http://i.imgur.com/DKfVTRn.png |
Lapd.mobi seems to scrape 4chan threads. I searched an image and it came up with it.
|
Thought you were going to share the bot.
Like your stuff though, intelligent. |
Quote:
|
Quote:
I have a cronjob setup to run every 10 minutes and it parses the front page threads and then follows and scrapes the media in those threads. Setup a cronjob: crontab -e or sudo crontab -e then at the bottom of the file input: */10 * * * * python /home/4chan/main.py Code:
# Edit this file to introduce tasks to be run by cron. Code:
from bs4 import BeautifulSoup |
that's a big order of cheese pizza.
|
I have Py installed, I will try to get it working if I have time.
|
Quote:
|
No shit. That's the last place I'd be scraping images from.
|
Quote:
|
What happened to the bot?
|
Quote:
the bro thing do now is put up paywall and charge for those... list every adult producer as 2257 page. |
nice :thumbsup
|
All times are GMT -7. The time now is 07:34 AM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123