![]() |
Limiting Google Bot?
Friends,
Is there a way to effectively limit the amount and frequency of the bot other than screwing around with robots.txt and google webmaster tools? |
Quote:
. |
You could work with 2 robots.txt, one which allows, another which doesn't, and set up a cron job which executes a script to exchange them from time to time. Ask your host if you have managed hosting, should be easy to set up. If you have a dedicated server you could block the IP when you want to for a while.
|
You could definitely do what AndrewX says above, but can I ask why you want to avoid the two most logical ways to limit crawl frequency? Do you not have the necessary privileges to access robots.txt or Webmaster Tools?
|
Sure. You can configure your server to only accept X number of requests per IP which will effectively rate limits Googlebot if they'd been coming at you with 10+ clients at a time.
Or if you feel like getting more complex, you can have your app server (PHP, Python, Rails) handle the rate limiting and send a 504 header (indicating a temporary overload) to Googlebot when its making too many requests. The downside of each is that it might hurt your SEO if you do it badly, and Google sees your site as 'slow' rather than rate limited. |
Is this called Bandwidth throttling? I can't find a link for info on setting this up. Any more info would be very appreciated.
I have a dedicated server. I use an access list to block the entire google subnet. I never got much traffic from google. I wouldn't mind a little more but I don’t want to be extorted into using their tools. I don't have a problem with letting google in but man they have literally send hundreds of bots hammering me all at one time and my performance for possible paying customers ends up sucking. |
What its called is 'Rate limiting by IP'.
If you're using Nginx, its provided by the HttpLimmitReqModule. More info here: http://serverfault.com/questions/179...-prevent-abuse Basically it lets you make a setting that says no more than X request per second per IP. If you're using Apache, the module you'd use is called "mod_evasive". If you're using Lighttpd, the module you'd use is called ModEvasive. ------ Using any of those modules may help with your problem, as well as stop people from spidering and ripping your sites quickly. The issue with Google is that if they use many IP addresses, they may get around your rate-limit. |
Friend. Thanks! I've got it setup. I'll let you know my progress.
|
I tried 3 techniques
1. mod_evasive 2. mid_limitipconn 3. Iptables Limits Connections Per IP None were effective. I never found robot.txt to be effective. I ended up signing up for google webmaster tools just to limit the crawl rate. I feel extorted. I will play around with another solution soon but thanks for all the info. I learned a lot. |
All times are GMT -7. The time now is 01:21 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123