Reinforcement Learning Based Incremental Web Crawling

Vatsal Agarwal, Innoplexus AG

Abstract: 

Current crawling engines face a challenge in keeping the data up to date. They need to keep checking every webpage, forum thread, social media handle, blogs & news for any updates. Some webpages update every few minutes and some not for months. We present an evolutionary learning framework for identifying the incremental changes on crawled webpage in a prioritized order, doing away with the "tabula rasa" view of learning. Our model learns heuristics based on the features from the webpage & frequency of updates. It generalizes on the past data and creates a prioritization threshold.

Vatsal Agarwal, Innoplexus AG

Vatsal leads artificial intelligence at Innoplexus AG, building cutting-edge technology for the pharmaceutical and life sciences industries. He works on the life sciences language-processing engine and the domain-wide ontology used in a variety of Innoplexus products & solutions.

Vatsal has more than a decade of experience in data science, software development, and bioinformatics. His primary focus is to bring advancements in artificial intelligence and big data to life sciences and help patients get faster, more efficient treatments. Vatsal has filed over 40 patent applications and written several peer-reviewed publications on machine learning and bioinformatics.

BibTeX
@conference {233775,
author = {Vatsal Agarwal},
title = {Reinforcement Learning Based Incremental Web Crawling},
year = {2019},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = may
}