Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 04-18-2016, 02:24 PM   #1
johnnyloadproductions
Account Shutdown
 
Industry Role:
Join Date: Oct 2008
Location: Gone
Posts: 3,611
GFY stats, how I gathered, parsed, and queried the data

I had gathered and parsed the data before but redid it recently with a cleaner approach. Here's how I gathered, parsed, and queried the data from GFY.

First start was gathering all the text from all the threads, I simply used a PHP script that followed redirects and had logic to determine if a thread were longer than 50 posts. This was the only place where I ran into issues. If a thread were exactly 50 pages long it would create a spider trap and the bot would keep going to a new page, based on the logic, and get served the exact page again. I discovered this when the crawl was already taking place so I simply lowered the maximum limit to follow long threads and deleted threads there were labeled at being 199 or pages longer. This removes really long threads and contest threads but should only effect overall 1-2% of fidelity.

I used 4 bots running in parallel, each bot was on a seperate digital ocean vps, each bot ran for around 2-3 days, I can't remember exactly. I used a cron job that started the script on reboot. Here's the script I used:

Code:
<?php
/**
 * Created by JetBrains PhpStorm.
 * Date: 5/16/14
 * Time: 6:43 PM
 * To change this template use File | Settings | File Templates.
 */



//
$GFY_THREAD = "https://gfy.com/showthread.php?t=";
$thread_ID = 900001;
$thread_max = $thread_ID + 300000;
$DIRECTORY = "/home/GFY/threads/";
$curl = curl_init();
for ($thread_ID = 900001; $thread_ID < $thread_max; $thread_ID++)

{

    $fp = fopen($DIRECTORY . "/" . $thread_ID . ".txt", "w");

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($curl, CURLOPT_FILE, $fp);

    $result = curl_exec($curl);

    fwrite($fp, $result);
    curl_close($curl);
    fclose($fp);


    /**
     * Pagination Loop for threads with more than 49 posts
     */
    $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . ".txt");
    echo "\n Thread " . $thread_ID;
//    echo $fp;

    if (preg_match('/\btitle="Next Page\b/i', $fp))

    {
        echo "it did match";
        //checks if thread has more than 49 posts
        $page_value = 2;

        for ($page_value = 2; $page_value < 200; $page_value++)

        {

            $fp = fopen($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt", "w");
            $curl = curl_init();
            curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "&page=" . $page_value);
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
            curl_setopt($curl, CURLOPT_FILE, $fp);


            $result = curl_exec($curl);
            fwrite($fp, $result);


            curl_close($curl);
            $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt");
            if (!preg_match('/\btitle="Next Page\b/i', $fp))

            {

                break;

            }

            fclose($fp);
        }

    }

    usleep(500);

}
?>

Each bot harvested somewhere in the 50-55 GB range of uncompressed text files, afterward I rared all of them and then downloaded them to my local drive.

Total scraped around 200 GB.


Pro tip: If you ever plan on parsing thousands or millions of text documents, make sure to do it on a solid state drive.

GFY doesn't have a clean DOM and even has left overs from TexasDreams (:
Hence the parsing script looks like a mess and all the try except blocks.

I ran this on a 2010 intel i7 iMac and ran 1 folder at a time, it took about 48 hours + or - 4 hours for each bot scrape to parse.

Total time, around 6 days of full constant parsing.

(parsing script in next post, this post was too long)
Around 50 posts were parsed a second.

Technologies used:
Python 2.7
BeautifulSoup4 (for parsing)
SQLAlchemy for ORM
MySQL



For the queries, the initial ones I just used a simple query in Sequel Pro on my Mac.

Highest Post count?
Code:
SELECT DISTINCT username, postcount FROM posts ORDER BY postcount DESC;
I had to do a lot of manual filtering in excel because people posted during the scrape and I hardcorded the postcount into each row :p

Highest thread starter
Code:
SELECT username, COUNT(*) as count FROM threads GROUP BY username ORDER BY count DESC;
For the thread and postcount frequency through the board history I had to run a nested query in python as follows. It took over an hour to tabulate all the post counts as there's over 20 million of them to filter through for each query.
Code:
for i in range(2001, 2017):
    for j in range(1, 13):
        thread_activity = "SELECT COUNT(*) as count FROM threads WHERE year = {0} and month = {1}".format(i, j)
        q = connection.execute(thread_activity)

        for r in q:
            print str(r[0])
I simply used excel to create charts of the data.

Here's a link with the mysql dump I'll leave it up for a week: GFY parsed DB



johnnyloadproductions is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-18-2016, 02:28 PM   #2
johnnyloadproductions
Account Shutdown
 
Industry Role:
Join Date: Oct 2008
Location: Gone
Posts: 3,611
This is the parsing script used on the text files. Like I said the GFY DOM is a mess so the script is customizably a mess to handle it. Sometimes the mods remove a user as guest and that funkafies the results which are handled.

Full of print statements for debugging (I need to move away from that). I need to abstract away more of my code as well.

The parsing script:

Code:
from sqlalchemy.ext.automap import automap_base
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table, Column, Integer, Numeric, String, ForeignKey, text
from sqlalchemy import insert
from sqlalchemy import engine
Base = automap_base()
from bs4 import BeautifulSoup
import os
import re
import time


# Time the script starter
start = time.time()
directory = '/path/to/textfiles'

# file number tracker
i = 1



for file in os.listdir(directory):
    print i
    i = i + 1
    if file.endswith('.txt'):
        threadID= year= month= day= hour= minute= join_month= join_year= post_in_thread= post_number = 0
        user_name= AMorPM= status= location= message = ""



        # try:
        f = open(directory + '/' + file, 'r+', )
        threadID = file.split('.')[0]
       
        soup = BeautifulSoup(f.read(), 'lxml')
        engine = create_engine('mysql+pymysql://user:pass'
                                       '@localhost/GFY_2016')



        post_in_thread = 0
        thread_title = ""
        posts = soup.find_all('table', attrs={'id':re.compile('post')})
        for p in posts:



            items = BeautifulSoup(str(p), 'lxml')
            date = items.find('td', attrs={'class':'thead'})
            date_string = BeautifulSoup(str(date)).get_text().strip()
            parsed_date = date_string.split('-')

            try:
                # Gets the month, day, year from the extracted text
                month = parsed_date[0]

                # print "day: " + parsed_date[1]
                day = parsed_date[1]

                parsed_date = parsed_date[2].split(',')
                year = parsed_date[0]


                post_time = parsed_date[1].split(':')
                hour = post_time[0]
                minute = post_time[1].split(' ')[0]
                AMorPM = post_time[1].split(' ')[1]

             
            except:
                pass

            try:
                post_number = items.find('a', attrs={'target':'new'})
                test =  BeautifulSoup(str(post_number))
                post_in_thread = test.get_text()



                # Get the username of the individual
                user_name = items.find('a', attrs={'class':'bigusername'})
                name = BeautifulSoup(str(user_name)).get_text()
                user_name = name
                # print name
            except:
                pass

            try:
                # Get the status of the user, e.g. confirmed or so fucking banned
                status = items.find('div', attrs={'class':'smallfont'})
                status = BeautifulSoup(str(status)).get_text()
                # print status


                # Join date
                join_date = items.find(string=re.compile("Join Date:"))
                join_date = BeautifulSoup(str(join_date)).get_text()
                # print join_date
                join_month = join_date.split(' ')[2]
                join_year = join_date.split(' ')[3]

            except:
                pass


            # Location
            try:
                location = items.find(string=re.compile("Location:"))
                location = BeautifulSoup(str(location)).get_text()
            except:
                pass
                # print "Location: null"


            try:
                posts = items.find(string=re.compile("Posts:"))
                posts = BeautifulSoup(str(posts)).get_text().strip()
                posts = posts.split(' ')[1].replace(',','')
                post_number = posts
            except:
                pass
                # print "Posts: null"

            try:
                # print items
                # print items.find('div', attrs={'id', re.compile('post_message')})
                # print items.find_all(id=re.compile('post_message'))
                message = BeautifulSoup(str(items.find_all(id=re.compile('post_message')))).get_text()

                message = message.replace('\\n','').replace(']', '').replace('[', '').replace('\\r', '')
                # print message
            except:
                pass
                # print "message: null"

            # This code creates a new thread entry if the post is determined to be the first one
            if test.get_text() == '1':

                try:
                    # Select table here and make new thread title
                    title_block = items.find('td', attrs={'class','alt1'})
                    thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
                    thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
                    thread_title = re.search('(?<=<strong>)(.*?)(?=</st)', str(thread_title))
                    # print thread_title.group(0)
                    # print "This is the first post"
                    metadata = MetaData()
                    thread = Table('threads', metadata,
                        Column('title', String),
                               Column('threadID', String),
                               Column('title', String),
                               Column('username', String),
                               Column('year', Integer),
                               Column('month', Integer),
                               Column('day', Integer),
                               Column('hour', Integer),
                               Column('minute', Integer),
                               Column('AMorPM', String),
                        # Column('post_date', String(20)),
                        # Column('post_name', String(255), index=True),
                        # Column('post_url', String(255)),
                        # Column('post_content', String(20000))
                        )
                    metadata.create_all(engine)

                    # Make sure to add items here that were parsed
                    ins = insert(thread).values(
                        threadID=threadID,
                        title=thread_title.group(0),
                        username=user_name,
                        year=year,
                        month=month,
                        day=day,
                        hour=hour,
                        minute=minute,
                        AMorPM=AMorPM
                        # post_name=title,
                        # post_url=url,
                        # post_content=string,
                    )

                    # insert into database the parsed logic
                    engine.execute(ins)
                    # engine.dispose()
                    # engine = create_engine('mysql+pymysql://user:pass'
                    #                    '@localhost/GFY_2016')
                except:
                    pass

            try:
                # print 'This is trying to insert into posts:'
                # Select table here and make new thread title
                # title_block = items.find('td', attrs={'class','alt1'})
                # thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
                # thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
                # thread_title = re.search('(?<=<strong>)(.*?)(?=</st)', str(thread_title))
                # print thread_title.group(0)
                # print "This is the first post"
                metadata = MetaData()
                posts = Table('posts', metadata,
                           Column('threadID', String),
                           Column('username', String),
                           Column('year', Integer),
                           Column('month', Integer),
                           Column('day', Integer),
                           Column('hour', Integer),
                           Column('minute', Integer),
                           Column('AMorPM', String),
                           Column('join_year', Integer),
                           Column('join_month', String),
                           Column('post_in_thread', Integer),
                           Column('postcount', Integer),
                           Column('message', String)

                    )
                metadata.create_all(engine)

                # Make sure to add items here that were parsed
                ins = insert(posts).values(
                    threadID=threadID,
                    username=user_name,
                    year=year,
                    month=month,
                    day=day,
                    hour=hour,
                    minute=minute,
                    AMorPM=AMorPM,
                    join_year=join_year,
                    join_month=join_month,
                    post_in_thread=post_in_thread,
                    postcount=post_number,
                    message=message

                )

                # insert into database the parsed logic
                engine.execute(ins)
            except:
                pass

            # print "\n"
            # connection.close()
        # except:
        engine.dispose()
        # engine.close()
        #     pass

print time.time() - start
johnnyloadproductions is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-18-2016, 02:29 PM   #3
johnnyloadproductions
Account Shutdown
 
Industry Role:
Join Date: Oct 2008
Location: Gone
Posts: 3,611
DB upload is still 20 mins from this post, so just give it a little.
johnnyloadproductions is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-18-2016, 02:32 PM   #4
CaptainHowdy
Too lazy to set a custom title
 
CaptainHowdy's Avatar
 
Industry Role:
Join Date: Dec 2004
Location: Happy in the dark.
Posts: 92,999
__________________
FLASH SALE INSANITY! deal with a 100% Trusted Seller
Buy Traffic Spots on a High-Quality Network

1 Year or Lifetime — That’s Right, Until the Internet Explodes!
CaptainHowdy is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-18-2016, 02:40 PM   #5
johnnyloadproductions
Account Shutdown
 
Industry Role:
Join Date: Oct 2008
Location: Gone
Posts: 3,611
One could bot the message field in the posts table and tally url popularity. Someone more familiar with big data technologies may have a go at it.
Botting freeones and some of the piracy forums would be a good indicator of popularity of models and programs over time. Would be interesting.
johnnyloadproductions is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-18-2016, 02:42 PM   #6
johnnyloadproductions
Account Shutdown
 
Industry Role:
Join Date: Oct 2008
Location: Gone
Posts: 3,611
Way before my time. Relic of the past.
johnnyloadproductions is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-19-2016, 04:58 AM   #7
CaptainHowdy
Too lazy to set a custom title
 
CaptainHowdy's Avatar
 
Industry Role:
Join Date: Dec 2004
Location: Happy in the dark.
Posts: 92,999
Quote:
Originally Posted by johnnyloadproductions View Post
Way before my time. Relic of the past.
He sure is greatly missed ...
__________________
FLASH SALE INSANITY! deal with a 100% Trusted Seller
Buy Traffic Spots on a High-Quality Network

1 Year or Lifetime — That’s Right, Until the Internet Explodes!
CaptainHowdy is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-19-2016, 05:18 AM   #8
Rob
I'm a great bowler.
 
Rob's Avatar
 
Industry Role:
Join Date: Nov 2003
Location: Right Outside of Normal.
Posts: 13,310
Love this thread.
__________________
Rob is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks

Tags
threads, script, data, parsed, gathered, bot, thread, logic, ran, gfy, page, pages, queried, simply, effect, deleted, 1-2%, follow, reboot, labeled, removes, contest, job, 2-3, parallel



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.