Distributed Web Crawling over DHTs
Authors:
Loo, Boon Thau
Cooper, Owen
Krishnamurthy, Sailesh
Technical Report Identifier: CSD-04-1305
2004
CSD-04-1305.pdf
Abstract: In this paper, we present the design and implementation of distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on PlanetLab querying live web sources.