Anatomy Of A Search Engine Crawler


Anatomy Of A Search Engine Crawler

 by: Rob Sullivan

When you go to a search engine and perform a search many people don't understand how those results end up there. Some people may think that sites are submitted while others know that a piece of software finds the pages. This article explains one piece of that puzzle: The search engine crawler.

Todays search engines rely on software packages called spiders or robots. These automated tools are used to search the web to discover new pages.

A brief history of search crawlers

The first crawler was the World Wide Web Wander and it appeared in 1993. It was developed by MIT and it's initial purpose was to measure the growth of the web. Soon after, however, an index was generated from the results – effectively the first "search engine."

Since then, crawlers have evolved and developed. Initially crawlers were simple creatures, only able to index specific bits of web page data such as meta tags. Soon, however, search engines realized that a truly effective crawler needs to be able to index other information, including visible text, alt tags, images and even other non-HTML content such as PDF's word processor documents and more.

How a crawler works

Generally, the crawler gets a list of URL's to visit and store. The crawler doesn't rank the pages, it only goes out and gets copies which it stores, or forwards to the search engine to later index and rank according to various aspects.

Search crawlers also are smart enough to follow links they find on pages. They may follow these links as they find them, or they will store them and visit them later.

To date there are literally dozens of crawlers out regularly indexing the web. Some are specialized crawlers – such as image indexers, while others are more general and therefore more well known.

Some of the most well known crawlers include Googlebot (from Google) MSNBot (from MSN) and Slurp (from Yahoo!). There is also the Teoma crawler (from Ask Jeeves), as well as an assortment of crawlers from other engines, such as shopping engines, blog search engines and more.

Generally, when a crawler comes to visit a site, they request a file called "robots.txt." this file tells the search crawler which files it can request, and which files or directories it's not allowed to visit.

The file can also be used to limit specific spiders access to any or all of the site, and can also be used to control how many times the crawler visits the site, by limiting it's speed or the times when the crawler can visit. (Yahoo!s Slurp and MSNBot both support the "Crawl Delay" directive which tells the crawlers to slow down on their crawling).

It's not imperative that a site have a robots.txt file however as a crawler will assume it is OK to index the site if there isn't such a file.

Generally, today's crawlers are stripped down versions of web browsers. Some, like Googlebot, are built upon a text based web browser called Lynx. Therefore one of the tools one can use to verify a site is the Lynx browser. by loading the site in the browser you can see essentially what the crawlers "sees." You can then look for errors in the pages as well as any navigation problems the crawler may come up against.

One other thing you may notice, as you view your web server log reports, is that some browsers come many different times and with many different configurations.

Yahoo!s Slurp, for example emulates many different hardware platforms – from Windows 98 to Windows XP, and many different browsers, from Internet Explorer to Mozilla. MSNbot also works like this – emulating different operating systems and browsers.

They do this to ensure compatibility – after all, the search engines want to be sure that the majority of their users find a site which they can use. Therefore, as a design tip, you should test your site against various hardware platforms and browsers as well. You don't have to use the variety that the search engines use, but you should test against Internet Explorer, Netscape and Firefox. Also, you should try your site on other platforms such as a Mac or Linux just to ensure compatibility.

You may also notice, upon reviewing your reports, that crawlers like Googlebot will visit repeatedly and request the same page(s) repeatedly. This is common as crawlers also want to be sure the site is stable and also to measure the page's change frequency.

If your site goes down temporarily when a crawler visits repeatedly like this, don't worry. The crawlers are smart enough to leave and come back later and try again. If, however, the continue to find the site down, or slow to respond, they may opt to stay away for longer periods, or index the site more slowly. This can negatively impact your site's performance in the search engines.

As time goes on, we'd expect these spiders to become even more advanced. As new authoring technology comes available, or new indexing options become available, then the search crawlers will be adapted. Remember, the goal of all the search engines is to have the most complete index of files found on the web. This means they want to be able to index more than just web pages.

So as you are designing your site, be sure to keep the crawlers in mind. Don't build your site for crawlers – build it for users – but be sure to test it thoroughly so that the crawlers see what you want them to without hindrances or roadblocks. Remember – the crawler is a site owners best friend.