Home & Garden Home Improvement

Digital Security Report: The Battle bots (google)

In the days before Google, in fact, it was around 3 BG), AltaVista was the latest search engine available to Internet users. To demonstrate the superior performance of their mini computers, AltaVista team of Digital decided to index and search the entire Internet. At the time this was something new, and most webmasters do not want this "robot" programs visit cost of the pages on their site because of the resulting increased load on their servers and the associated increase in bandwidth.

This led to the Robots Exclusion Standard. This standard was established in 1996 to prevent just that. A simple text file called robots.txt search engine has not to visit a specific directories that you want. Here is a very simple example of a robots.txt file to keep all the search engines from the / images directory. User-agent: * Disallow: / images with refusing / images that you also implicitly denying any subdirectories below / images, such as / images / logos, and all files beginning with / images such as / images.html. Strangely enough, the first draft of this standard does not include an "Allow" directive. It was later added, but without a guarantee of support from all search engines.

This means that anything that is not explicitly denied a goal for the web crawler. If you choose, that he may have access to your entire web site, you look like a robots.txt as follows: User-agent: * Disallow: / User-agent If * then the following lines apply to all search engine robots. By specifying the signature of a web crawler as a user agent can give you specific instructions to that robot. User-agent: Googlebot Disallow: / google-secrets has been issued since the first specification, some search engines have extended the protocol. One example is the use of wildcards. User-agent: Slurp Disallow: / *. gif $ This denies Yahoo, Their web crawler is known as the Slurp from which to index files on your site that end with the suffix ". gif". You must not forget that the wild card games are not supported in any search engine.

This means that these lines have to preface with the correct user-agent line. You can robots.txt more of the above techniques in a file. Here is a theoretical example. User-agent: * Disallow: / Bar User-agent: Googlebot Allow: / foo Disallow: / bar Disallow: / *. gif $ Disallow: / Computer programs are pretty good at following instructions like these. But for a human brain can quickly overwhelming, so I encourage you very much, keep it simple. For us mortals, there is a robots.txt analysis tool in Google Webmaster Tools. Highly recommended.

Another good source for more information on the Robots Exclusion Standard is www.robotstxt.org. Many companies are willing to get a large sum to drop your site to search engine offers. So you can mix like this in reverse. However, there are smart security reasons set a limit to how much can index your site from a search engine. My Digital Security Report provides more information on this subject.

Leave a reply