Estimating the Amount of Script-generated Traffic in a Mixture

Cormac Herley, Microsoft Research

We address the question of estimating the fraction of traffic that is bot-generated in a mixture. That is, we seek to estimate α when what we receive is α · Clean + (1-α) · Bot. This is primarily of interest when traffic is attempting to masquerade as human-generated (eg, click-fraud, inauthentic social media engagement, etc).

When at least one pair of features is independent in the clean traffic (eg, time-invariance of geographic distribution) we show that getting an upper-bound on α is equivalent to finding the rank-one matrix that maximizes a simple objective function. We give an efficient method for solving, and derive the tightness of the bound. When we have limited data, error analysis is extremely important, since the sampled version of a rank-one matrix will not be precisely rank-one. We derive confidence intervals for our estimates, that allow us to be confident that we find a true upper-bound.

We empirically validate our findings. First, using random rank-one, and full-rank matrices for the clean and bot distributions respectively, we verify accuracy using Monte Carlo simulations. Second, we examine Twitter (now X) data. Twitter accounts with large follower-ship that were offered for sale on an open market-place are flagged as having >90% bot followers, while accounts for several academic conferences and well-known researchers are flagged at <20%. We verify accuracy on Twitter account populations of arbitrary clean/bot composition.