GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   BLOCKING crawlers whose SE's send you zero traffic (https://gfy.com/showthread.php?t=934427)

rowan 10-20-2009 06:25 PM

BLOCKING crawlers whose SE's send you zero traffic
 
I've been looking at one of my mainstream sites and there are several crawlers (such as cuil.com's) that like to munch on my pages, but the search engine referral traffic they send is virtually (or even absolutely) zero.

There's a lot of pages on this site so a fair percentage of the access is from crawler bots. (GoogleBot hits it 100,000+ times a day but that figure is way ahead of the others, and G actually sends back referrals...)

I've been thinking about blocking the deadbeat crawlers via robots.txt, but then there's always the question looming - will the search engines they're attached to start sending traffic in the future? Am I going to shoot myself in the foot?

Has anyone contemplated this scenario?

Agent 488 10-20-2009 06:29 PM

if you are concerned about bandwidth costs from search engine bots you have bigger issues than the ones you may be contemplating.

beemk 10-20-2009 06:39 PM

Quote:

Originally Posted by Agent 488 (Post 16449499)
if you are concerned about bandwidth costs from search engine bots you have bigger issues than the ones you may be contemplating.

what he said

rowan 10-20-2009 06:57 PM

LOL. Even though the SE bots are actually the majority of the "traffic" loading the site this issue is nothing to do with bandwidth. It is more multiple database accesses and overall load of the server.

woj 10-20-2009 07:00 PM

cloak the pages for the other bots, who knows might trick them into sending some nice traffic :1orglaugh

Agent 488 10-20-2009 07:02 PM

if you are concerned about server stress from search engine bots you have bigger issues than the ones you may be contemplating.

Quote:

Originally Posted by rowan (Post 16449574)
LOL. Even though the SE bots are actually the majority of the "traffic" loading the site this issue is nothing to do with bandwidth. It is more multiple database accesses and overall load of the server.


rowan 10-20-2009 07:12 PM

Quote:

Originally Posted by Agent 488 (Post 16449583)
if you are concerned about server stress from search engine bots you have bigger issues than the ones you may be contemplating.

Once again, there are more bots than humans loading this site. Out of ~200k loads per day roughly 60% are identifying themselves as bots, and there's probably 20-25% more that are rogue bots or site scrapers.

It's a profile site that pulls together various interlinking bits and pieces so it is reasonably database intensive.

So I'm not really concerned about server load NOW, more in the future... and really, I just don't get why I should let (eg) cuil scrape my site when they return zero traffic...

uno 10-20-2009 07:17 PM

Why don't you just use some sort of cacheing to cut down on DB queries?

rowan 10-20-2009 07:25 PM

Quote:

Originally Posted by uno (Post 16449614)
Why don't you just use some sort of cacheing to cut down on DB queries?

With hundreds of millions of profiles it's pretty much all long tail, not much opportunity to cache anything. A page that a bot accesses may not be accessed by someone else for hours or even days. DB queries don't fare much better, with some basic profiling I've worked out that caching results would only result in a 5-10% reduction in raw queries, and at that level the negative overhead of caching could become comparable to the benefit it provides...

FWIW I am planning to set up multiple backend servers so that I can simply add more when the load gets too high... I'm really just curious whether anyone's said "fuck you" to a (bona fide but obscure) search engine bot that does nothing but scrape. :pimp

Thurbs 10-20-2009 07:46 PM

just redirect them all to a flat landing page or doorway page.

then if anyone does actually come from their sites, they can choose to press enter ..


All times are GMT -7. The time now is 06:06 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123