Virtual machines have been used as honeypots for detecting unknown attacks by several researchers [4,16,17,25,26]. Although, honeypots have traditionally been used mostly for detecting attacks against servers, the same principles also apply to client honeypots (e.g., an instrumented browser running on a virtual machine). For example, Moshchuk et al. used client-side techniques to study spyware on the web (by crawling 18 million URLs in May 2005 ). Their primary focus was not on detecting drive-by downloads, but in finding links to executables labeled spyware by an adware scanner. Additionally, they sampled URLs for drive-by downloads and showed a decrease over time. However, the fundamental limitation of analyzing the malicious nature of URLs discovered by ``spidering'' is that a crawl can only follow content links, whereas the malicious nature of a page is often determined by the web hosting infrastructure. As such, while the study of Moshchuk et al. provides valuable insights, a truly comprehensive analysis of this problem requires a much more in-depth crawl of the web. As we were able to analyze many billions of URLs , we believe our findings are more representative of the state of the overall problem.
More closely related is the work of Provos et al.  and Seifert et al.  which raised awareness of the threat posed by drive-by downloads. These works are aimed at explaining how different web page components are used to exploit web browsers, and provides an overview of the different exploitation techniques in use today. Wang et al. proposed an approach for detecting exploits against Windows XP when visiting webpages in Internet Explorer . Their approach is capable of detecting zero-day exploits against Windows and can determine which vulnerability is being exploited by exposing Windows systems with different patch levels to dangerous URLs. Their results, on roughly URLs, showed that about of these were dangerous to users.
This paper differs from all of these works in that it offers a far more comprehensive analysis of the different aspects of the problem posed by web-based malware, including an examination of its prevalence, the structure of the distribution networks, and the major driving forces.
Lastly, malware detection via dynamic tainting analysis may provide deeper insight into the mechanisms by which malware installs itself and how it operates [10,15,27]. In this work, we are more interested in structural properties of the distribution sites themselves, and how malware behaves once it has been implanted. Therefore, we do not employ tainting because of its computational expense, and instead, simply collect changes made by the malware that do not require having the ability to trace the information flow in detail.Niels Provos 2008-05-13