A Timeline of events related to the Deep Web courtesy of Papergirls
1980 Tim Berners-Lee “developed his first hypertext system, “Enquire” for his own use (although unaware of the existence of the term HyperText). With a background in text processing, real-time software and communications, Tim decided that high energy physics needed a networked hypertext system and CERN was an ideal site for the development of wide-area hypertext ideas (CERN).”
1989 Tim Berners-Lee started the WorldWideWeb project at CERN.
1992-09 Arthur Secret at the CERN created the first web gateway to a relational database system RDB (Shestakov 2008-05).
1994 Dr. Jill Ellsworth “first coined the phrase “invisible Web” to refer to information content that was “invisible” to conventional search engines (Bergman 2001 citing Garcia 1996).” See also
1996 Frank Garcia (1996) claimed Texas-based university professor Jill H. Ellsworth (d.2002), Internet consultant for Fortune 500 companies, coined the term “Invisible Web” in 1996 to refer to websites that are not registered with any search engine. ” “Ellsworth is co-author with her husband, Matthew V. Ellsworth, of The Internet Business Book (John Wiley & Sons, Inc., 1994), Marketing on the Internet: Multimedia Strategies for the World Wide Web (John Wiley & Sons, Inc.), and Using CompuServe. She has also explored education on the Internet, and contributed chapters on business and education to the massive tome, The Internet Unleashed.”
[S]igns of an unsuccessful or poor site are easily identified, says Jill Ellsworth. “Without picking on any particular sites, I’ll give you a couple of characteristics. It would be a site that’s possibly reasonably designed, but they didn’t bother to register it with any of the search engines. So, no one can find them! You’re hidden. I call that the invisible Web. Ellsworth also makes reference to the “dead Web,” which no one has visited for a long time, and which hasn’t been regularly updated (Garcia 1996).
1996-12-01 “The first commercial Deep Web tool (although they referred to it as the “Invisible Web”) was @1, announced December 12th, 1996 in partnership with large content providers. According to a December 12th, 1996 press release, @1 started with 5.7 terabytes of content which was estimated to be 30 times the size of the nascent World Wide Web. ( “America Online to Place AT1 from PLS in Internet Search Area: New AT1 Service Allows AOL Members to Search “The Invisible Web”).”See (Choi 2008-01-07).”
1996-12-12 “Personal Library Software, Inc. (PLS), the leading supplier of search and retrieval software to the online publishing industry, ushered in the next generation of Internet search engines with the introduction of a new Internet based service, AT1 which combines the best of PLS’s search, agent and database extraction technology to offer publishers and users something they have never had before: the ability to search for content residing in “hidden” databases — those large collections of documents managed by publishers not viewable by Web spiders. AT1 also allows users to create intelligent agents to search newsgroups and websites with E-Mail notification of results (Press release).”
1997 Michael Lesk wrote an unpublished paper entitled ”How much information is there in the world?”], in which he estimated that in 1997, the Library of Congress had between 20 terabytes and 3 petabytes.” See Choi (2008).
1999-02 Lawrence and Giles (1999) claimed that the publicly indexable World Wide Web (PIW) contained about 800 million pages; the search engine with the largest index, Northern Light, indexed roughly 16% of the publicly indexable World Wide Web; the combined index of 11 large search engines covered (very) roughly 42% of the publicly indexable World Wide Web.
2000-03 c. 43,000–96,000 Deep Web sites existed (Bergman 2001).
2000-07-26 BrightPlanet released a study documenting the Deep Web (a massive storehouse of databases and information that was invisible to search engines in 2000) claiming that the Deep Web was 500 times larger than the indexed Web accessible by most search engines. BrightPlanet researchers also released their direct-query search technology called LexiBot™ which automatically identifies, retrieves, qualifies, and classifies content from Deep Web sites. They listed c. 20,000 Deep Web searchable sites. Direct-query search technology that can access searchable databases unlike most search engines, implies that the Invisible Web is not really Invisible just harder to reach.BrightPlanet Unveils the ‘Deep’ Web: 500 Times Larger than the Existing Web.
“quantified the size and relevancy of the deep Web in a study based on data collected between March 13 and 30, 2000. Our key findings include: Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web; The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web; The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web; More than 200,000 deep Web sites presently exist; Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information — sufficient by themselves to exceed the size of the surface Web forty times; On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public; The deep Web is the largest growing category of new information on the Internet; Deep Web sites tend to be narrower, with deeper content, than conventional surface sites; Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web; Deep Web content is highly relevant to every information need, market, and domain; More than half of the deep Web content resides in topic-specific databases; A full ninety-five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions (Bergman 2001).”
2001 AlltheWeb, public search engine was launched. (AlltheWeb is now owned by Yahoo.com). It was a redesign of Fast (1999-05 to 2001). Fast Search & Transfer is a Microsoft Subsidiary.
2000 Shestakov (2008) cites Bergman (2001) as the source for the claim that the term deep Web was coined in 2000. Bergman distinguished the Surface Web from the Deep Web using the metaphor of Surface and Deep water fishing or trawling. Deep Web is preferred over the term Invisible Web.
2000 UC-Berkeley Biologist Michael Eisen, Nobel Laureate Harold Varmus and Stanford biochemist Patrick Brown helped start the Public Library of Science, PLoS is a “nonprofit organization of scientists and physicians committed to making the world’s scientific and medical literature a freely available public resource” by encouraging scientists to insist on open-access publishing models rather than being forced to sign over their (often publicly-funded research) to expensive scientific journals. Wright (2004) cited Eisen, Varmus and Brown as examples of scientists who are making making some areas of the Deep Web more accessible to the public.
2001 Raghavan and Garcia-Molina (2001) “presented an architectural model for a hidden-Web crawler that used key terms provided by users or collected from the query interfaces to query a Web form and crawl the deep Web resources (Choi 2008-01-07).”
2002-02 StumbleUpon began to use human crawlers or human-based computation techniques to uncover data on the Deep Web. Human crawlers can find relevant links that algorithmic crawlers miss (Choi 2008-01-07).”
2002-12 There were c. 130,000 Deep Web sites (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2003-06-01 Dorner and Curtis (2003-06-01) conducted a survey (data collected from 2002-12 through 2003-04) of librarians in New Zealand to compare their common user interface software products supplied by vendors: Endeavour, ExLibris, Follet, Fretwell-Downing, Innovative Interfaces, MuseGlobal, OCLC, SIRSI, WebFeat and VTLS. MuseSearch, ENCompass, MetaLib, Single Search and WebFeat received the highest scores in 2003 (Dorner and Curtis 2003-06-01:2). SingleSearch was noted as having the added cost advantage to librairies since it was open access, open source (Dorner and Curtis 2003-06-01:2). In 2002-2003 a successful common user interface technology software should support formats and protocols other than Z39.50 such as OpenURL, HTTP, SQL, XML, MARC, CrossRef, DOI, EAD, Dublin Core and Telnet (Dorner and Curtis 2003-06-01:8).
2004-04 There were c. 310,000 Deep Web sites (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2004 Between 2000 and 2004 the Deep Web increased in size by 3-7 times (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2004-03-02 Yahoo announced its Content Acquisition Program users paid for enhanced search coverage by “unlocking” the deep Web (Wright 2004).
2005 Yahoo released Yahoo! Subscriptions which searched a few of the Deep Web’s subscription-only web sites.
2005 Ntoulas et al. (2005) “created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms. Their crawler generated promising results, but the problem is far from being solved. Since a large amount of useful data and information resides in the deep Web, search engines have begun exploring alternative methods to crawl the deep Web (Choi 2008-01-07).”
The search engine Pipl crawlers can identify, interact and retrieve some information from the deep Web.
Deep Web “search engines like CloserLookSearch and Northern Light Group|Northern Light create specialty engines by topic to search the deep Web. Because these engines are narrow in their data focus, they are built to access specified deep Web content by topic. These engines can search dynamic or password protected databases that are otherwise closed to search engines (Choi 2008-01-07).”
Google’s “Sitemap and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web(Choi 2008-01-07).”
2007-06 WorldWideScience was created to provide access to the Deep Web. When it began it linked to 12 databases from 10 countries. It is a “science portal developed and maintained by the Office of Scientific and Technical Information (OSTI), an element of the Office of Science within the U.S. Department of Energy. The WorldWideScience Alliance, a partnership consisting of participating member countries provides the governance structure for the WorldWideScience.org portal (RWW).”
2007-07-27 “Indiana University faculty member Javed Mustafa appeared on National Public Radio’s Science Friday, and drawing on information in a published study from University of California, Berkeley entitled ”How much information is there?”, estimated that the deep web consists of about 91,000 terabytes. By contrast, the surface web, which is easily reached by search engines, is only about 167 terabytes. The Library of Congress contains about 11 terabytes, for comparison. Mustafa noted that these numbers were a bit dated and were just rough estimates (Choi 2008-01-07).”
2008-05-14 ReadWriteWeb contributor Sarah Perez listed a number of “Digital Image Resources on the Deep Web.“
2008-06 WorldWideScience portal to the Deep Web linked to 32 national, scientific databases and portals from 44 different countries. RWW.
2008 Several “Deep Web directories are under development such as OAIster by the University of Michigan, INFOMINE] at the University of California at Riverside andDirectSearch by Gary Price to name a few (Choi 2008-01-07).”
2008-09-22 Infovell launched its research engine for the Deep Web. “Available initially on a subscription basis, Infovell gives users access to hard to find, in-depth, expert information spanning Life Sciences, Medicines, Patents, and other reference categories with more to be added over time.” “Infovell’s research engine will be available beginning September 22 as a premium service for individual researchers and corporations who are seeking more affordable access to expert information. The Company is offering a risk-free trial through its website http://www.infovell.com. Later this year, Infovell will be beta-releasing a free version of its research engine on a limited basis for those individuals who want to search the Deep Web but don’t have the need for some of the advanced features available in the premium version.”
2009- United States “Congressional Representative John Conyers (D-MI) re-introduced a bill (HR801) that essentially would negate the National Institutes of Health (NIH) policy concerning depositing research in Open Access (OA) repositories. The bill goes further than prohibiting open access requirements, however, as the bill also prohibits government agencies from obtaining a license to publicly distribute, perform, or display such work by, for example, placing it on the Internet, and would repeal the longstanding ‘federal purpose’ doctrine, under which all federal agencies that fund the creation of a copyrighted work reserve the ‘royalty-free, nonexclusive right to reproduce, publish, or otherwise use the work’ for any federal purpose. The National Institutes of Health require NIH-funded research to be published in open-access repositories (Doctorwo 2009).” HR801 would benefit for-profit science publishers and increase challenges for making the Deep Web more accessible. See Doctorwo, Cory. 2009-02-16. “Scientific publishers get a law introduced to end free publication of govt-funded research.” » Boing Boing.