GoFuckYourself.com - Adult Webmaster Forum - Robots.txt

GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)

- Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)

- - Robots.txt (https://gfy.com/showthread.php?t=864867)

BigPimpCash

10-27-2008 08:57 AM

Robots.txt

I am looking into the SEO of my sites and someone mentioned a robots.txt file that I should have ??? I am of course going to speak to my best friend Mr Google... but wondered if anyone could take 5 mins to explain what it is and what it should consist of... ?

:helpme

Lycanthrope

10-27-2008 09:05 AM

In a nutshell, robots.txt is used to tell spiders / crawlers what NOT to scan.

Not every robot obeys it however.

User-agent: *
Disallow: /

Would tell ALL robots not to look at any pages.

I recommend NOT using the example above.

The Duck

10-27-2008 09:11 AM

I tell the robots the adress to my sitemap, done!

DutchTeenCash

10-27-2008 09:12 AM

google it and read some stuff it can do a lot for your sites for sure

BigPimpCash

10-27-2008 09:23 AM

Hmmm...

Quote:

Originally Posted by DutchTeenCash (Post 14957272)

google it and read some stuff it can do a lot for your sites for sure

I am currently reading up on it... I am a little confuzzled though... I want bots/crawlers to read my site, so does it matter if I dont have a robots.txt file ? It says if I have a robots.txt file thats empty they will all take it that its ok to enter... but in not having one or having a blank one does that mean they are less likely to crawl the site ?

I understand however some bots are undesirable... where is there a list of the bots that are not wanted ???

Someone said about bots that some are bad bots that for example harvest emails, and others that are site rippers... but it appears these are more blockable by re writing your HTACESS file as opposed to editing the robots.txt file... but again I would ask where you find a list of bad bots/site rippers that is up to date to put into your HTACCESS

TripleXPrint

10-27-2008 09:29 AM

Robot text files are pretty much useless unless you're trying to get Google to not index a page. Google will crawl your page and index your content so just keep the robot.txt file out of your header. You really don't need it. :2 cents:

Jdoughs

10-27-2008 09:31 AM

Quote:

Originally Posted by BigPimpCash (Post 14957322)

I am currently reading up on it... I am a little confuzzled though... I want bots/crawlers to read my site, so does it matter if I dont have a robots.txt file ? It says if I have a robots.txt file thats empty they will all take it that its ok to enter... but in not having one or having a blank one does that mean they are less likely to crawl the site ?

I understand however some bots are undesirable... where is there a list of the bots that are not wanted ???

Someone said about bots that some are bad bots that for example harvest emails, and others that are site rippers... but it appears these are more blockable by re writing your HTACESS file as opposed to editing the robots.txt file... but again I would ask where you find a list of bad bots/site rippers that is up to date to put into your HTACCESS

If you want them to spider everything, dont add one. If you feel you need to, add a general one allowing all bots, if you have a specific bot problem, ban that one bot. You can also use it to ban certain directories.

To allow all robots complete access:

User-agent: *
Disallow:

To exclude all robots from the server:

User-agent: *
Disallow: /

To exclude all robots from parts of a server:

User-agent: *
Disallow: /private/
Disallow: /images-saved/
Disallow: /images-working/

To exclude a single robot from the server:

User-agent: Named Bot
Disallow: /

TheSenator

10-27-2008 09:47 AM

You should hooked this up on your site.
http://www.google.com/webmasters/tools/

baddog

10-27-2008 10:18 AM

Use it to keep spiders out of certain parts of your site.

alexchechs

10-27-2008 10:19 AM

They are pretty simple files and can do alot for your site. I also suggest using a .xml site map and have google spider it often

BigPimpCash

10-27-2008 11:25 AM

Thanks guys...

great advice keep it coming... I was thinking of a xml sitemap, just not sure how many things will be on it as we're a paysite and not that many pages :) but if it helps the SEo then I will speak to my guy about it...

How do I get google to look at it lots :)

Tempest

10-27-2008 02:27 PM

This is what I put on all of my sites

User-agent: Fasterfox
Disallow: /
User-agent: *
Disallow:
Sitemap: http://...../.....txt

The sitemap is just a text listing of all the urls in UTF-8. Sitemap is a recent addition to the robot.txt spec and the big 3 all grab it now. Much easier if you have lots of sites to do this than fuck around with the google stuff.

http://www.sitemaps.org/protocol.php#informing

The Fasterfox thing is due to some FF plugin or something that causes "false" hits to your site because it preloads pages.

BigPimpCash

10-27-2008 08:03 PM

Ok...

So the general consencus is we should create some form of site map... either xml or txt... and then put within the robots.txt a command to make sure that the bots read/crawl the sitemap :)

Anyone add anything to keep the bad ones out... or is that done mainly via the htpaccess file ? either way is there a good example ? For either... that shows ones that should be included ? :thumbsup

mona	10-27-2008 08:51 PM

Quote:

Originally Posted by TripleXPrint (Post 14957348)

Robot text files are pretty much useless unless you're trying to get Google to not index a page. Google will crawl your page and index your content so just keep the robot.txt file out of your header. You really don't need it. :2 cents:

:2 cents:

If you submit your sitemap to Google Webmaster Tools, then they'll spider everything regularly...Amongst lots of other cool stuff.

mona	10-27-2008 08:53 PM

Quote:

Originally Posted by BigPimpCash (Post 14960916)

So the general consencus is we should create some form of site map... either xml or txt... and then put within the robots.txt a command to make sure that the bots read/crawl the sitemap :)

Anyone add anything to keep the bad ones out... or is that done mainly via the htpaccess file ? either way is there a good example ? For either... that shows ones that should be included ? :thumbsup

<:2 cents:> My fave --> www.xml-sitemaps.com </:2 cents:>

d-null

10-27-2008 09:08 PM

I add this to all of my sites' robots.txt :

User-agent: ia_archiver
Disallow: /

that keeps archive.org from spidering and keeping a copy of your site forever

BigPimpCash

10-28-2008 02:10 AM

Question

Quote:

Originally Posted by d-null (Post 14961085)

I add this to all of my sites' robots.txt :

User-agent: ia_archiver
Disallow: /

that keeps archive.org from spidering and keeping a copy of your site forever

Can I ask why this is a bad thing ? Isnt ia_archiver the alexa one ?

NinjaSteve

10-28-2008 08:20 AM

Quote:

Originally Posted by d-null (Post 14961085)

I add this to all of my sites' robots.txt :

User-agent: ia_archiver
Disallow: /

that keeps archive.org from spidering and keeping a copy of your site forever

Seems to be more than just the archive.org, but alexa overall.
http://www.alexa.com/site/help/webmasters

BigPimpCash

10-28-2008 10:02 AM

Question

Can anyone show me an example of a xml sitemap for a paysite that they use to encourage the spiders to crawl the page... ??? :helpme

wizzart

10-28-2008 11:50 AM

Quote:

Originally Posted by BigPimpCash (Post 14963143)

Can anyone show me an example of a xml sitemap for a paysite that they use to encourage the spiders to crawl the page... ??? :helpme

Quote:

User-agent: BecomeBot
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: Jetbot/1.0
Disallow: /

User-agent: Jetbot
Disallow: /

User-agent: Teoma
Disallow: /

User-agent: WebVac
Disallow: /

User-agent: Stanford
Disallow: /

User-agent: Stanford CompSciClub
Disallow: /

User-agent: Stanford CompClub
Disallow: /

User-agent: Stanford Spiderboys
Disallow: /

User-agent: scooter
Disallow: /

User-agent: naver
Disallow: /

User-agent: dumbot
Disallow: /

User-agent: Hatena Antenna
Disallow: /

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: looksmart
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Curl
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: Stanford Comp Sci
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: http://www.WebmasterWorld.com bot
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: http://www.SearchEngineWorld.com bot
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: WebmasterWorld Extractor
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: searchpreview
Disallow: /

User-agent: sootle
Disallow: /

User-agent: es
Disallow: /

User-agent: Enterprise_Search/1.0
Disallow: /

User-agent: Enterprise_Search
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: http://Anonymouse.org/_(Unix)
Disallow: /

User-agent: WebRipper
Disallow: /

User-agent: Download_Ninja_7.0
Disallow: /

User-agent: Download_Ninja_2.0
Disallow: /

User-agent: Download_Ninja_5.0
Disallow: /

User-agent: Download_Ninja_3.0
Disallow: /

User-agent: Download_Ninja/4.0
Disallow: /

User-agent: FDM Free Download Manager
Disallow: /

User-agent: *
Disallow: /cgi-bin

:2 cents::2 cents:

ScottXXX

10-28-2008 11:53 AM

Quote:

Originally Posted by kandah (Post 14957265)

I tell the robots the adress to my sitemap, done!

lmao....

BigPimpCash

10-28-2008 12:02 PM

Hey

Quote:

Originally Posted by wizzart (Post 14963690)

:2 cents::2 cents:

Thanks for that Wizzart... I take it thats what you use to stop bad bots and site rippers... ??? Does it work ok in the robots.txt as I was told a lot of them ignore the robots.txt file... and its better used putting it in the HTACCESS file instead ???

TripleXPrint

10-28-2008 12:22 PM

Quote:

Originally Posted by mona_klixxx (Post 14961047)

:2 cents:

If you submit your sitemap to Google Webmaster Tools, then they'll spider everything regularly...Amongst lots of other cool stuff.

Yep, that's exactly what I use and have A LOT of sites ranked on top of the SERPS for high ranking keywords. Give me a long tailed keyword and I guarantee you that I can have it on top of Google in less than a week. And that's without using blackhat methods. With blackhat I can get it there in less than 48 hours but it wouldn't stay up there long. To me optimization is like sex; the longer it lasts the better it feels. :winkwink:

Kudles

10-28-2008 12:22 PM

Google would help

All times are GMT -7. The time now is 04:03 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123