In a perfect computing world, workloads are reliable, secure, and use compute resources efficiently. In reality, however, goals and requirements for each of these areas are often at odds. For example, we could choose to run a particular job type exclusively on dedicated machines for stronger isolation, but that would likely strand resources and increase cost due to fragmentation. We can also bin pack as many workloads as possible on a single machine to maximize the utilization (potentially, overcommitting resources) but this has implications for reliability and security guarantees we provide.
Modern workloads vary significantly, for example, in terms of their criticality, which we cover in more detail later. We need to protect mission-critical jobs from others that are less important or that have more relaxed requirements for security and reliability.
Possible scenarios we want to prevent include:
Less sensitive workloads (with a weaker security posture) serving as a vector through which to escalate privileges on a particular machine (e.g., via a kernel vulnerability or a CPU side-channel attack), then hijacking more sensitive workloads running on the same machine.
Running sensitive workloads on machines with a more relaxed security configuration (e.g., more open ACLs to allow for easier debugging, essential for experimental and development jobs).
Excessive and opportunistic workloads impacting or even bringing down others ("noisy neighbors").
Therefore, we want to schedule workloads on machines with certain security characteristics and guarantees without requiring dedicated machine pools for each individual job type. We need to do that in a way that does not compromise reliability and uses compute resources efficiently, e.g., maximizing machine utilization via multi-tenancy. Finally, it is essential to minimize the maintenance cost and operational load for engineers, especially in large-scale computing environments.
To meet those requirements, we propose a novel approach: Workload Security Rings (WSR). While the WSR concept is based on our experience within Google's production environment, we believe this general technique will be applicable to other contexts such as Kubernetes and hope that other organizations can benefit from this concept to improve their workload isolation.
WSR splits workloads into security equivalence classes (rings), then isolates and enforces each class at the machine boundary (most rings map to a separate machine pool). This way, WSR addresses the typical isolation threats: hardware and software exploits, including zero days and DoS attacks on sensitive workflows. To compensate for a potential loss in utilization caused by separate machine pools and stricter scheduling constraints, we introduce a new class of hardened jobs that are not sensitive enough to be completely separated from other workloads, but have controls in place to prevent lateral movement between machines.
Several solutions exist to address the workload isolation issue. Since its inception, Google's Borg cluster management system has relied on Linux chroot jails for security isolation and cgroup-based containers for performance isolation ([1], section 6). Kubernetes follows a similar model and utilizes OS-level virtualization to isolate containers, which offer a weaker isolation boundary than virtual machines that utilize hardware-based virtualization [2].
Such weaker isolation is typically a good trade-off between efficiency and security for multi-tenant environments running trusted workloads. By "trusted workload", we mean one where the binary was verifiably built from peer-reviewed code of attested provenance and that does not process arbitrary untrusted data. Examples of workloads that do process untrusted data are hosting third-party code or encoding videos. We can achieve trust in our binaries by adopting a mechanism like Binary Authorization for Borg [3]. However, to safely process untrusted data, the workload must be sandboxed—for example, using gVisor. Both of these approaches have limitations.
Binary Authorization is great for mature products but greatly reduces velocity during early stages of new systems development or when running experiments, e.g., fine-tuning ML model parameters. This means that some workloads will always run without Binary Authorization guarantees. Such jobs increase the risk of an insider or an adversary who has compromised their machine running arbitrary code: for example, exploiting a kernel vulnerability to gain local root on a machine. While such a risk may be acceptable in smaller organizations, at Google scale we need an effective way to isolate untrusted jobs from more sensitive workloads.
Running untrusted workloads inside a sandbox is a good way to mitigate security risk, but it comes at the price of increased compute and memory use. Efficiency may drop by up to a double-digit percentage, depending on the workload and the type of sandbox. For example, see the detailed benchmarks in the gVisor Performance Guide [4].
To avoid co-scheduling sensitive services with workloads that are not trusted or ones that process untrusted data without a sandbox, running them on separate machines is a possibility. This approach is known as node isolation in Kubernetes and makes it possible to define pools of machines (rings) and assign workloads based on their security requirements. Note that Kubernetes nodes can map to physical or virtual machines — in this paper, we're focusing on the former and will refer to this concept as machine isolation. While it is easy to reason about the security properties of this approach, there are also certain downsides.
First, machine isolation introduces additional scheduling constraints that result in lower utilization. Note that we still employ machine isolation to protect the very highest sensitivity workloads, where we are willing to trade larger amounts of efficiency for higher security. Fortunately, these tend to have a relatively small resource footprint, and small isolated pools of dedicated machines have negligible impact on overall utilization. However, the cost becomes significant when there are several large isolated machine pools.
Second, as the resource footprint of workloads fluctuates in each ring, it is essential to repurpose machines. At minimum, it requires draining all the workloads from a machine to be transferred, but a reboot or even a reinstall may be required to prevent potential persistence attacks or data leakage between the rings. By reinstall, we mean our most heavy-weight machine sanitization process, that guarantees firmware authenticity and integrity security mechanisms [5].
Finally, if separation is done within the same Kubernetes cluster, then a skilled attacker could use the permissions assigned to the kubelet or other pods running on the node to move laterally within the cluster [6]. This can be mitigated by running each ring as a separate cluster, but that increases complexity further, and introduces additional maintenance and management cost.
Workload Security Rings were designed to address the limitations of machine isolation mentioned above. In the simplest case, we define three classes of workloads:
Sensitive: workloads that are critical to the organization (e.g., identity and access management) or that have broad access to sensitive data (e.g., customer data). Those workloads must employ hardening solutions like Binary Authorization.
Hardened: workloads that are not as critical but can be considered trusted, i.e., sandboxed, or adopted Binary Authorization and not processing untrusted data. In addition, these workloads must adopt controls that prevent lateral movement within the cluster. In particular, they must not be able to start new jobs on other machines, a requirement that does not apply to sensitive workloads.
Unhardened: all the other production jobs, including those running untrusted code.
Note that the definition of a sensitive workload is somewhat arbitrary and depends on the specifics of an organization. Sensitive workloads must use hardening solutions such as Binary Authorization, and not just sandboxing, because they operate on critical data—we need to have confidence that they operate correctly, not just that they do not compromise other workloads.
In contrast, workloads are treated as hardened solely based on their adopted controls, so classification can be automatic and objective. With hardened workloads, we are primarily concerned about their impact on other workloads, so hardening solutions can include sandboxing.
Given these definitions of workloads, we can split machines into two pools, trusted and untrusted, and introduce the following scheduling constraints (Fig 1):
Sensitive workloads may only ever run on trusted machines.
Unhardened workloads may only ever run on untrusted machines.
Hardened workloads may run everywhere.
While the last list item may seem counterintuitive, it effectively alleviates the utilization drop caused by the additional scheduling constraints. The hardened workloads fill in the cracks introduced by isolating sensitive and unhardened jobs. In addition, the bigger the hardened resource footprint is, the more fluctuations in resource usage of the other two classes can be accommodated without the need to move machines between the pools.
That said, Workload Security Rings still require automation to periodically rebalance the rings, migrating machines to the pool where the resource usage of jobs pinned to the ring is high relative to the pool’s size. Machine migrations are expensive, so to avoid unnecessary churn, rebalancing should be done based on long-term resource usage. We have found that a weekly cadence is a good choice to account for load differences between day and night, and between business days and weekends.
Despite rebalancing, a ring may become oversubscribed due to a sudden load spike, putting scheduling SLOs at risk. Site Reliability and Security Engineers may then decide to temporarily lift scheduling constraints to prevent or mitigate an outage. However, this is not a decision to be taken lightly, as it increases the risk of lateral movement between rings.
On the security side, Workload Security Rings give us a strong guarantee that we will never (well, except for the emergency situation in the previous paragraph) co-schedule sensitive workloads with ones that are untrusted, thus protecting them against machine-local privilege escalation attacks such as those based on kernel vulnerabilities or CPU side-channels. While hardened workloads may still potentially be at risk of such a compromise, the ban on remote job creation makes it prohibitively difficult for an attacker to move laterally to machines in the trusted pool (Fig 2).
This design can be generalized to accommodate more classes of workloads. Each will require a new dedicated pool of machines, but as long as the hardened workloads’ resource footprint remains relatively large, it will compensate for stricter scheduling constraints and fluctuations in resource usage in all the other pools. For example, we could introduce an additional pool to differentiate between sensitive jobs that handle confidential information (whether that be user/customer data or cryptographic material) and those that are availability-critical.
We examined the need to balance compute requirements between security, reliability, and efficiency. Typical means of workload isolation provide a solution to this tradeoff but have limitations. Specific goals and tradeoffs will vary by organization.
We introduced a novel approach for this problem space: Workload Security Rings (WSR). WSR introduces additional scheduling constraints, so that workloads of similar security requirements run on the same machines forming “rings”, but are never co-scheduled with jobs with different profiles. To compensate for a potential utilization drop, WSR introduces a class of jobs that can run everywhere and fill the gaps. Those workloads cannot create new jobs, which effectively prevents lateral movement, and thus doesn't jeopardize isolation between the rings.
Today, technical infrastructure is changing quickly: dimensions of compute clusters evolve, workload shapes fluctuate, we have to cope with hardware and software failures, and we need to accommodate maintenance windows. Thus, the rings need to be rebalanced periodically. Thanks to WSR automation, the associated maintenance cost is minimal.
We envision WSR to be transparent to its users. They should not worry or be exposed to machine pools, underlying automation, or even the concept of workload security rings. What they care about are security guarantees for their workloads that do not compromise on reliability and compute efficiency.
While the authors wrote this paper, many engineers contributed to Workload Security Rings. Aaron Joyner, Ken Stillson, and Rainer Wolafka defined the original concept. Michał Czapiński, Robert Obryk, and Theodoros Kalamatianos provided the technical design. Subsequent contributors include Adam Hraska, Balázs Kinszler, David Castro González de Vega, Endre Hirling, Hang Jin, Jakub Warmuz, Jordi Duran, Mathieu Thoretton, Nikolett Lehel, Paweł Jasiak, Sahil Shekhawat, and Tom Chestna.
Finally, the authors would like to thank Jon McCune, Massimiliano Poletto, Pankaj Rohatgi, Sangeetha Alagappan, and Theodoros Kalamatianos for reviewing this paper and providing invaluable feedback and suggestions.