Check out the new USENIX Web site. next up previous
Next: Estimating the number of Up: Estimators based on flow Previous: Estimating packet counts


Estimating byte counts

Before discussing how to estimate byte count estimates in flow slices, we show why a simpler solution does not work. We could have the byte counter $ c_b$ in the flow entry just count the total number of bytes in the packets seen once the flow record is created. Just like with the packet counter, we need an additive correction to account for the packets missed before the creation of the entry. We can get an unbiased estimate for the number of packets missed, but not for their total size, because we do not know their sizes. We could assume that the packet sizes are uniform within the flow, but this would lead to systematic biases because they are not. As the proof of shows, storing the size of the sampled packet that led to the creation of the entry would solve the problem because using it to estimate the total number of bytes in the packets not counted does lead to an unbiased estimator. But this would require another entry in the flow record. Instead, we store this information in the byte counter itself by initializing $ c_b$ to $ b_{first}/p$ when the entry is created ($ b_{first}$ is the size in bytes of the sampled packet). Let $ b$ be the number of bytes of the flow at the input of the flow slicing algorithm.

$\displaystyle \widehat{b}=c_b$ (2)

Lemma 2   $ \widehat{b}$ as defined in is an unbiased estimator of $ b$.

Proof: By induction on the number of packets in the flow $ s$. Let $ b_i$ for $ i$ from $ 1$ to $ s$ be the sizes of the individual packets. By definition the number of bytes in the flow is $ b=\sum_{i=1}^sb_i$. For convenience of notation, we index the packet sizes in reverse order, so $ b_1$ will be the size of the last packet and $ b_s$ the size of the first one.

Base case If s=1, the only packet is sampled with probability $ p$ and in that case it is counted $ c_b=b_1/p=b/p$ bytes. With probability $ 1-p$, it is not sampled (and it counts as 0). Thus $ E[c_b]=p\cdot b/p+0=b$.

Inductive step By induction hypothesis, we know that if the first packet is not sampled we are left with the last $ s'=s-1$ packets and $ E[c_b]=b'=b-b_s$. If the first packet gets sampled, we count it as $ b_s/p$ and we count the rest exactly because the flow slice length $ t$ and the inactivity timeout $ t_{inactive}$ are larger than the bin size.

$\displaystyle E[c_b]$ $\displaystyle =$ $\displaystyle p\cdot(b_s/p+b')+(1-p)b'$  
  $\displaystyle =$ $\displaystyle b_s+pb'+(1-p)b'=b_s+b'=b$  

$ \blacksquare$

If we sample packets randomly with probability $ q$ before applying the flow slicing algorithm, we will want to estimate the number of bytes $ B$ at the input of the packet sampling stage. Since $ E[b]=qB$, it is easy to show that $ \widehat{B}=1/q\widehat{b}$ is an unbiased estimator for $ B$.


next up previous
Next: Estimating the number of Up: Estimators based on flow Previous: Estimating packet counts
Ramana Rao Kompella 2005-08-12