![]() |
GFY stats, how I gathered, parsed, and queried the data
I had gathered and parsed the data before but redid it recently with a cleaner approach. Here's how I gathered, parsed, and queried the data from GFY.
First start was gathering all the text from all the threads, I simply used a PHP script that followed redirects and had logic to determine if a thread were longer than 50 posts. This was the only place where I ran into issues. If a thread were exactly 50 pages long it would create a spider trap and the bot would keep going to a new page, based on the logic, and get served the exact page again. I discovered this when the crawl was already taking place so I simply lowered the maximum limit to follow long threads and deleted threads there were labeled at being 199 or pages longer. This removes really long threads and contest threads but should only effect overall 1-2% of fidelity. I used 4 bots running in parallel, each bot was on a seperate digital ocean vps, each bot ran for around 2-3 days, I can't remember exactly. I used a cron job that started the script on reboot. Here's the script I used: Code:
<?php Each bot harvested somewhere in the 50-55 GB range of uncompressed text files, afterward I rared all of them and then downloaded them to my local drive. Total scraped around 200 GB. Pro tip: If you ever plan on parsing thousands or millions of text documents, make sure to do it on a solid state drive. GFY doesn't have a clean DOM and even has left overs from TexasDreams (: Hence the parsing script looks like a mess and all the try except blocks. I ran this on a 2010 intel i7 iMac and ran 1 folder at a time, it took about 48 hours + or - 4 hours for each bot scrape to parse. Total time, around 6 days of full constant parsing. (parsing script in next post, this post was too long) Around 50 posts were parsed a second. Technologies used: Python 2.7 BeautifulSoup4 (for parsing) SQLAlchemy for ORM MySQL For the queries, the initial ones I just used a simple query in Sequel Pro on my Mac. Highest Post count? Code:
SELECT DISTINCT username, postcount FROM posts ORDER BY postcount DESC; Highest thread starter Code:
SELECT username, COUNT(*) as count FROM threads GROUP BY username ORDER BY count DESC; Code:
for i in range(2001, 2017): Here's a link with the mysql dump I'll leave it up for a week: GFY parsed DB http://i.imgur.com/8XJar0b.png http://i.imgur.com/NgbBrN1.png |
This is the parsing script used on the text files. Like I said the GFY DOM is a mess so the script is customizably a mess to handle it. Sometimes the mods remove a user as guest and that funkafies the results which are handled.
Full of print statements for debugging (I need to move away from that). I need to abstract away more of my code as well. The parsing script: Code:
from sqlalchemy.ext.automap import automap_base |
DB upload is still 20 mins from this post, so just give it a little.
|
|
One could bot the message field in the posts table and tally url popularity. Someone more familiar with big data technologies may have a go at it.
Botting freeones and some of the piracy forums would be a good indicator of popularity of models and programs over time. Would be interesting. |
:) Way before my time. Relic of the past.
http://i.imgur.com/UgrJ2yA.png |
Quote:
|
Love this thread. :thumbsup
|
All times are GMT -7. The time now is 05:11 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc