Basic privacy protection

Scrubbing sensitive fields. Before an alert is sent to the repository, the producer must remove all sensitive information not needed for collaborative analyses described in section 7, including all content in Captured_Data , Infected_File and Outcome fields. A more advanced version of our system may enable privacy-preserving analysis based on commonalities in the Captured_Data field, e.g., presence of ``bad words'' associated with a particular virus. Possible techniques include encryption with keyword-specific trapdoors in the manner of [29,5].

The Sensor_Id field may be either re-mapped to a unique persistent pseudonym (e.g., a randomly generated string) that leaks no information about the organization that owns it, or replaced with just the make and model information. The Timestamp field is rounded up to the nearest minute. Although this disables fine-grained propagation analyses, it adds additional uncertainty against attackers staging probe-response attacks.

Hiding IP addresses. Suppose the attacker controls the repository. He may launch an attack and then attempt to use the alert generated by the victim's sensor to analyze the attack's propagation through the victim's internal network. Therefore, the producer must hide both Source_IP and Dest_IP addresses before releasing the alert to the repository.

Encrypting IP addresses under a key known only to the producer is unacceptable, as it hides too much information. With a semantically secure encryption scheme, encrypting the same IP address twice will produce different ciphertexts, disabling collaborative analysis. Hashing the address using a standard, universally computable hash function such as SHA-1 or MD5 enables dictionary attacks. If the attacker controls the repository, he can target a system on a particular subnet and pre-compute hash values of all possible IP addresses at which sensors may be located or to which he expects the attack to propagate. This is feasible since the address space in question is relatively small -- either 256, or 65536 addresses (potentially even smaller if the attacker can make an educated guess). The attacker verifies his guesses by checking whether the received alert contains any of the pre-computed values.

Our solution strikes a balance between privacy and utility. The producer hashes all IP addresses that belong to his own network using a keyed hash function such as HMAC [3,4] with his secret key. All IP addresses that belong to external networks are hashed using a standard hash function such as SHA-1 [23]. This guarantees privacy for IP addresses on the producer's own network since the attacker cannot verify his guesses without knowing the producer's key. In particular, probe-response fails to yield any useful information. Of course, if these addresses appear in alerts generated by other organizations, then no privacy can be guaranteed.

We pay a price in decreased functionality since alerts about events on the network of organization A that have been generated by A's sensors cannot be compared with the alerts about the same events generated by organization B's sensors. Recall, however, that we are interested in detecting large-scale events. If A is under heavy attack, chances are that it will be detected not only by A's and B's sensors, but also by sensors of C, D, and so on. Because A's network is external to B, C, and D, their alerts will have A's IP addresses hashed using the same standard hash function. This will produce the same value for every occurrence of the same IP address, enabling matching and counting of hash values corresponding to frequently occurring addresses. Intuitively, any subset of participants can match and compare their observations of events happening in someone else's network. The cost of increased privacy is decreased utility because hashing destroys topological information, as discussed in section 7.2. Naturally, an organization can always analyze alerts referring to its own network, since they are all hashed under the organization's own key.

An additional benefit of using keyed hashes for alerts about the organization's own events and plain hashes for other organizations' events is that the attacker cannot feasibly determine which of the two functions was used. Even if the attacker controls the repository and directly receives A's alerts, he cannot tell whether an alert refers to an event in A's or someone else's network. The attacker may still attempt to verify his guesses by pre-computing hashes of expected IP addresses and checking alerts submitted by other organizations, but with hundreds of thousands of alerts per hour and thousands of possible addresses this task is exceedingly hard. Staging a targeted probe-response attack is also more difficult: the probe may never be detected by another organization's sensors, which means that the response is never computed using plain hash, and the attacker cannot stage a dictionary attack at all. Finally, note that keyed hashes do not require PKI or complicated key management since keys are never exchanged between sites.

Re-keying by the repository. To provide additional protection against a casual observer or an outside attacker when an alert is published, the repository may replace all (hashed) IP addresses with their keyed hashes, using the repository's own private key. This is done on top of hashing by the alert producer, and preserves the ability to compare and match IP addresses for equality, since all second-level hashes use the same key. This additional keyed hashing by the repository defeats all probe-response and dictionary attacks except when the attacker controls the repository itself and all of its keys, in which case we fall back on protection provided by the producer's keyed hashing.

Randomized hot list thresholds. For collaborative detection of high-volume events, it is sufficient for the repository to publish only the hot list of reported alerts that have something in common (e.g., source IP address, port/protocol combination, event id) and whose number exceeds a certain threshold. As described in section 4, this may be vulnerable to a flooding attack, in which the attacker launches a probe, and then attempts to force the directory to publish the targeted system's response, if any, by flooding it with ``matching'' fake alerts based on his guesses of what the real alert looks like.

Our solution is to introduce a slight random variation in the threshold value. For example, if the threshold is

, the repository chooses a random value

between

and

, and, if

is exceeded, publishes only

alerts. If the attacker submits

fake alerts and a hot list of

alerts is published, the attacker doesn't know if the repository received

alerts, including a matching alert from the victim. There is a small risk that some alerts will be lost if their number is too small to trigger publication, but such alerts are not useful for detecting high-volume events.

Delayed alert publication. If the alert data is used only for research on historical trends (see section 7.1), delayed alert publication provides a feasible defense against probe-response attacks. The repository simply publishes the data several weeks or months later, without Timestamp fields. The attacker would not be able to use this data to correlate his probes with the victim's responses.

Examples of basic sanitization for different alert types can be found in tables 2 through 4.

Table 2: Example firewall security alert sanitization.

Field ID	Raw firewall alert	Sanitized firewall alert
Source_IP	172.16.30.2	0x16e9368f
Source_Port	1147	1147
Dest_IP	173.19.33.1	0x78a65237
Dest_Port	135	135
Protocol	6	6
Timestamp	09032003:01:03:10	09032003:01:03:00
Sensor	PIX-4-10060231	PIX
Count	1	1
Event_ID	Deny	Deny
Outcome	none	none
Capture_Data	none	none
Infected_File	none	none

Table 3: Example IDS security alert sanitization.

Field ID	Raw IDS alert	Sanitized IDS alert
Source_IP	172.16.30.49	0xb09956c2
Source_Port	1299	1299
Dest_IP	176.20.22.43	0xd6e79b79
Dest_Port	80	80
Protocol	6	6
Timestamp	10132003:11:41:09	10132003:11:41:00
Sensor	EM-HTTP-90209321	EM-HTTP
Count	1	1
Event_ID	CGI_ATTACK	CGI_ATTACK
Outcome	NO_REPLY	none
Capture_Data	/scripts/.%255c%255c./winnt/system 32/cmd.exe?/c+dir	none
Infected_File	none	none

Table 4: Example antivirus security alert sanitization.

Field ID	Raw AV Alert	Sanitized AV alert
Source_IP	none	none
Source_Port	none	none
Dest_IP	176.30.22.11	0xb4ddc807
Dest_Port	none	none
Protocol	none	none
Timestamp	11172003:09:39:00	11172003:09:39:00
Sensor	NORTON-AV-02209302	NORTON-AV
Count	1	1
Event_ID	W32.Sobig.F.Dam	W32.Sobig.F.Dam
Outcome	Left alone	none
Capture_Data	none	none
Infected_File	A0014566.pdf	none