For our evaluation on DETER, we used PC3080 systems as the nodes for the Bro cluster. These systems are 3 GHz, dual CPU pentium-4 based Xeon systems with hyperthreading enabled, all on a single, high bandwidth Nortel stack. For the traffic generating systems, we used the PC3000 systems which are of comparable performance and located on the same Nortel stack.1
Our initial test (Table 1) used a 500 multiplication of the HTTP stream and a single sending node. Going from a single Bro instance (without cluster extensions) to a two-sensor cluster showed an expected slowdown in processing per node. But we were initially puzzled by the super-linear speedup observed in the HTTP stresstest as the number of nodes increased. We believed this is due to interference effects or state exhaustion.
If we take the single stream and multiply it by a factor of 100 before sending it to a single processor Bro instance, Bro is able to handle 4900 pps before it starts dropping traffic. When we change it to a factor of 500, Bro is only able to handle 3600 pps. This appears due to interference effects as multiple HTTP analyzers cause state explosion effects or some linear processing as a function of the number of outstanding streams.
We confirmed this hypothesis by creating a 500x multiplied stream which acted to concatenate the original stream sequentially rather than interleaving the traffic. Bro was capable of handling this linearly-interleaved stream at only 3600 pps. Thus the slowdown is probably the result of state exhaustion or linear processing due to data structures being populated but not deleted when connections closed, rather than cache effects, as cache effects wouldn't cause a slowdown on the concatenated stream.
To further narrow down the cause, we performed a detailed profile on a single (no multiplication) and a 100 multiplied stream. In this profile, almost all the additional time is due to iterations over data structures. Thus it is clear that some non-constant-time data structure usage is what results in the slowdown for larger multiplication factors.
To further examine the superlinear effects, we created a second testcase. In this testcase, the duplication factor was scaled linearly with the number of nodes up to 8 nodes and the interleaving factor was increased.2Additionally, we shifted from one sending node to four sending nodes to increase the sending bandwidth.
When the workload per node is scaled up with the number of nodes, the superlinear speedup evaporates and a minor sublinear speedup is observed instead, as the cluster no longer benefits from having smaller data structures. When the workload is no longer scaled with cluster size, the performance again resumes its superlinear growth.
Until the 31 sensor case, the critical bottleneck was probably in the sensors themselves, as they would drop packets and therefore produce an incorrect analysis. Thus for moderate and smaller clusters, we appear to have good scalability. Even for the 24 sensor case, the cluster would work reliably up until the sensors started dropping packets.
The 31 sensor case, however, had all sensors reporting 0 drops, but the manager's alarm log did not include all the alarms. This suggests that we have hit a scalability limit in the communication or aggregation, either on the manager or the proxy, when we use this many sensor nodes. Additionally, the 31 sensor case is unreliable, sometimes it will succesfully log all alarms at a much higher data rate and sometimes would drop alarms. We reported the data rate which the cluster would reliably log all alarms. We plan on investigating this further.