The syncache implementation replaces the per-socket linear chain of incomplete queued connections with a global hashtable, which provides two forms of protection against running out of resources. These are a limit on the the total number of entries in the table, which provides an upper bound on the amount of memory that the syncache takes up, and a limit on the number of entries in a given hash bucket. The latter limit bounds the amount of time that the machine needs to spend searching for a matching entry, as well as limiting replacement of the cache entries to a subset of the entire cache. A global table was chosen instead of a per-socket table as it was felt this would be a more efficient use of system resources. A current implementation restriction that all kernel virtual address space for the memory used at interrupt time must be pre-allocated was also a factor in this decision.
One of the major bottlenecks in the original code was the random drop implementation from the linear list, which did not scale. This bottleneck avoided in the syncache, since the queue is split among hash buckets, which are then treated as FIFO queues instead of using random drop. Another way of viewing this is to consider the original linear list partitioned up into a number of sublists equivalent to the size of the hash table, where choosing a bucket enables us to choose which section of the list to drop. Since the hash distribution across the buckets should be uniform, this is an approximate model of choosing a random list entry to drop.
The hash value is computed on the incoming packet using the source and destination addresses, the source and destination port, and a randomly chosen secret. This value is then used as an index into a hash table, where syncache entries are kept on a linked list in each bucket. The secret is used to perturb the hash value so that an attacker cannot target a specific hash bucket and deny service to a specific machine.
While on the surface it may appear that an attacker could implement a DoS by targeting a hash bucket so that a legitimate connection does not reside on the queue long enough to establish a connection, the risks are migitated by the use of the hash secret. Additionally, since the port number of the connecting machine is used in the hash calculations, a second connection attempt from the client machine tends to result in a second hash bucket chosen, further styming any attempt by an attacker to target a specific bucket.
If the entry is not found in the bucket, a new syncache entry is created and added to the cache. If the new entry would overflow the per-bucket limit, the oldest entry within that bucket is dropped. If the total number of entries in the cache is exceeded, the oldest entry in the cache is dropped. This way, both the memory usage of the syncache and the amount of CPU time needed to search the hash table are bounded. The user is able to control the sizing of these limits via the following loader tunables established at boot time:
[fontfamily=tt, fontsize=\footnotesize] net.inet.tcp.syncache.hashsize net.inet.tcp.syncache.cachelimit net.inet.tcp.syncache.bucketlimit
The cachelimit setting determines the maximum number of syncache entries that may be allocated, and bounds the overall memory usage of the system. hashsize controls the size of the hash table and should be a power of 2. Finally, bucketlimit caps the size of each hash chain, and limits the number of entries that must be searched when looking for a matching SYN entry. However, as the list is handled in FIFO order, an entry must stay on the list for at least one round trip time (RTT) to the remote system in order to successfully establish a connection, so this must be considered when choosing a value for bucketlimit.
There are two additional sysctl parameters of interest:
[fontfamily=tt, fontsize=\footnotesize] net.inet.tcp.syncache.count net.inet.tcp.syncache.rexmtlimit
The first entry is read-only, and indicates how many entries are currently present in the syncache. The second determines how many times a SYN,ACK should be retransmitted to the remote system, and defaults to 3. Three retransmits corresponds to seconds, and the odds are that if a connection cannot be established by then, the user has given up.