Web wide crawl with preliminary seedlist and crawler setup from March 2011. This utilizes the brand-new HQ software for distributed crawling by Kenji Nagahashi.
Exactly what’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Variety of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, recovered near to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) spider software and appreciated robots.txt instructions. The scope of the crawl was not restricted except for a couple of by hand excluded websites.
However this was a somewhat speculative crawl for us, as we were using newly minted software to feed URLs to the spiders, and we understand there were some operational concerns with it. For example, in most cases we might not have crawled all of the ingrained and linked items in a page since the URLs for these resources were included into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so taking a look at results by country will be somewhat skewed.
We have actually made lots of changes to how we do these large crawls considering that this particular example, however we wished to make the data readily available “warts and all” for individuals to explore. We have also done some additional analysis of the material.
If you would like access to this set of crawl information, please contact us at details at archive dot org and let us know who you are and what you’re wanting to do with it. We might not have the ability to state “yes” to all requests, considering that we’re simply finding out whether this is a great idea, but everybody will be thought about.