A Student's Guide to the Deep Web

The Internet is a treasure trove of knowledge, especially for students in search of immediate information gratification. However, the ‘Net contains billions of files, and unless you know the exact URL of the one you want, you’re going to have to rely on search engines to help you unearth the info you need.

Search engines are tools that allow you to search for information available on the Web using keywords and search terms. Rather than searching the Web itself, however, you are actually searching the engine's database of files.

Search engines are actually three separate tools in one. The spider is a program that “crawls” through the Web, moving from link to link, looking for new web pages. Once it finds new sites or files, they are added to the search engine's index. This index is a searchable database of all the information that the spider has found on the Web. Some engines index every word in each document, while others select certain words. The search engine itself is a piece of software that allows users to search the engine's database. Clearly, an engine's search is only as good as the index it's searching.

When you run a query using a search engine, you're really only searching the engine's index of what's on the Web, as opposed to the entire Web. No one search engine is capable of indexing everything on the Web - there's just too much information out there! Additionally, many spiders cannot or will not enter databases or index files. Consequently, much of the information excluded in search engine queries includes breaking news, documents, multimedia files, images, tables, and other data. Collectively, these types of resources are referred to as the deep or invisible Web. They're buried deep in the Web and are invisible to search engines. While many search engines feature some areas of the deep web, most of these resources require special tools to unearth them.

Estimates vary, but the deep web is much larger than the surface web. Approximately 500 more times information is located on the deep web as exists on the surface web. This consists of multimedia files, including audio, video, and images; software; documents; dynamically changing content such as breaking news and job postings; and information that's stored on databases, for example, phone book records, legal information, and business data. Clearly, the deep web has something to offer almost any student researcher.

The easiest way to find information on the deep web is to use a specialized search engine. Many search engines index a very small portion of the deep web; however, some engines target the deep web specifically. If you need to find a piece of information that's likely to be classified as part of the deep web, search engines that focus on such content are your best bet.

Like surface web engines, deep web search engines may also sell advertising in the form of paid listings. They differ in their coverage of deep web content and offer dissimilar advanced search options. Engines that search the deep web can be classified as first vs. second generation, individual vs. meta, and/or separate vs. collated retrieval, just as with surface web engines. Thus, you'll need to familiarize yourself with the options that are available and gradually add the best engines to your bag of research tricks.

Let's look at two popular deep web search engines for an illustration:

1. Complete Planet (www.completeplanet.com) is a free commercial search engine. It acts as a gateway to other search services, providing links to over 70,000 search sites. For easy browsing, the links are organized by subject into a “browse tree.” You can also search their links by keyword, which will retrieve a relevance-ranked list of results. While they do sell advertising, paid results are clearly labeled as such.

2. Scirus (http://www.scirus.com/srsapp/), in contrast, is more limited in scope. An academic engine, it does not sell advertising or feature paid listings. Rather than trying to provide access to the entire deep web, it focuses on scientific content. Users can search over 167 million scientific web pages, databases, and journals with Scirus. Results can be sorted in several ways, including by relevance and source. Scirus is provided free by Elsevier, a company that also markets databases to individuals and institutions.

Obviously, Scirus is a more scholarly search engine than Complete Planet, and thus is more appropriate for your academic research needs. Well, assuming that you're conducting research for a physics or psychology class, of course! If literature's your thing, perhaps you might want to try out another academic deep web engine, such as the Directory of Open Access Journals (http://www.doaj.org/) or the New York Public Library's holdings (http://www.nypl.org/).

When doing research for a class, you need to be just as discriminating with deep web search engines as you are with other online tools. Always look for an engine's advertising policy, and consider where it gets its funding. Look for non-profit engines that only index information from reputable sources. Search engines with a filter are a plus; for example, Scirus's engine discards non-scientific web sites and relies mainly on information from the top-level domains “.edu” and “.org”.

As with the rest of the Internet, the deep web can be an excellent resource - but only when used with caution!

Kelly Garbato is an author, ePublisher, and small business owner. She recently self-published her first book, “13 Lucky Steps to Writing a Research Paper,” now available at Amazon.com (http://www.amazon.com) or through Peedee Publishing (http://www.peedeepublishing.com).

To learn more about the author, visit her web site at http://www.kellygarbato.com.

A Student's Guide to the Deep Web

Contact Me