GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   Google to make robots.txt an Internet standard after 25 years (https://gfy.com/showthread.php?t=1315134)

Bladewire 07-01-2019 05:50 PM

Google to make robots.txt an Internet standard after 25 years
 

Google demanding more free work & expense from people to bend to their fucking will :disgust

Google to make robots.txt an Internet standard after 25 years

The Robots Exclusion Protocol (REP) — better known as robots.txt — allows website owners to exclude web crawlers and other automatic clients from accessing a site. “One of the most basic and critical components of the web,” Google wants to make robots.txt an Internet standard after 25 years.

Despite its prevalence, REP never became an Internet standard, with developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” Additionally, it doesn’t address modern edge cases, with web devs and site owners ultimately still having to worry about implementation today.

On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

To address this, Google — along with the original author of the protocol from 1994, webmasters, and other search engines — has now documented how REP is used on the modern web and submitted it to the IETF.

The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.

The robots.txt standard is currently a draft, with Google requesting comments from developers. The standard will be adjusted as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.”

This standardization will result in “extra work” for developers that parse robots.txt files, with Google open sourcing the robots.txt parser used in its production systems.

This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

brassmonkey 07-01-2019 06:00 PM

they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything

trevesty 07-02-2019 05:47 AM

Been running websites for over 15 years and making money from it. Tens of thousands of sites at least...

And every single one of them has had a robots.txt file. I don't see the issue.

Bladewire 07-02-2019 09:23 AM

Quote:

Originally Posted by trevesty (Post 22494065)
And every single one of them has had a robots.txt file. I don't see the issue.

It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.

brassmonkey 07-02-2019 12:11 PM

Quote:

Originally Posted by Bladewire (Post 22494206)
It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.

a sitemap is more complex :upsidedow have no issues of google, bing, or yandex saying change a thing.

Bladewire 07-07-2019 01:07 PM

Quote:

Originally Posted by brassmonkey (Post 22494302)
a sitemap is more complex :upsidedow have no issues of google, bing, or yandex saying change a thing.

We agree

rowan 07-07-2019 11:37 PM

Funny how Google is going on about making a de-facto a standard, when they explicitly ignore a fairly important (IMHO) de-facto directive: Crawl-delay.

Website: I'm asking you nicely to please limit your fetching to once per 60 seconds.

GoogleBot: No.

thommy 07-08-2019 12:15 AM

Quote:

Originally Posted by brassmonkey (Post 22493843)
they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything

I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.

magneto664 07-08-2019 02:30 AM

Quote:

Originally Posted by thommy (Post 22497550)
a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.[/QUOTE]

Quote:

Originally Posted by thommy (Post 22497550)
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.

Klen 07-08-2019 02:45 AM

Quote:

Originally Posted by thommy (Post 22497550)
I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.

Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.

thommy 07-08-2019 02:46 AM

Quote:

Originally Posted by magneto664 (Post 22497577)
Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.

but this bots are not google. nobody will try to sue them.

I really know how a robots.txt is working but the point is that millions who have an internet presence don´t know.

if google crawls something from their site WITHOUT AN EXPLICIT demand to do so, they can be seen as "victim" from the one or other judge and can sue Google for millions.

this is why it would make sense to make robots.txt as THE rule to crawl your site and sites without robots.txt would not be touched.




Quote:

If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.
as i said - if there are no clear rules for that it will open big doors for lawsuits. and not the others would be the ones that have to fight it - it would be the one who have the money to pay.

thommy 07-08-2019 02:55 AM

Quote:

Originally Posted by KlenTelaris (Post 22497582)
Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.

that is exactly what i meant.

the laws in the various countries are so different that you can not even decide who is a professional who HAVE to know it and who is not.

when the internet started nobody ever thought about such things like privacy and permission to crawl a page. it was simply assumed that everyone who posts something on the internet wants others to find it. this case have changed a lot in the meantime and the views on right or wrong in the world are so completely different that everything have to be EXPLICIT allowed and not just assumed.


All times are GMT -7. The time now is 06:15 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc