Untangling the Cloud

A Principled Method for Grouping Cloud Resources

November 2, 2022

Tutorial

Authors:

Article shepherded by:

Laura Nolan

In this article, I’ll explain how to draw technical borders to divide your cloud resources into groupings, such as Google Cloud projects, AWS accounts, or Azure resource groups; and how to make the infrastructure groupings follow the logical boundaries of your systems and your organization. Towards the end, I will present Ferent, a new open-source analysis tool that I developed for measuring inter-project coupling in Google Cloud Platform.

An Example Cloud Architecture

Consider an e-commerce business with several microservices, running in AWS. Though smaller development groups tend to put everything in a single AWS account, this is a large development organization and have split their resources into seven accounts.

The Users account manages the authentication, as well as user information such as addresses. All other accounts call it to confirm the identity of users, and to get information. Customers order through the Orders project, which invokes the Warehouse to place the order; the Warehouse needs to invoke the Supply Chain for just-in-time manufacturing. The order is then sent through the Shipping system. A Front-end account supports ordering, and shows users the status of their order as it is shipped.

Problematic cloud architecture

This initial architecture is a bit tangled. The dependency arrow to Users is two-way, which is wrong: Most parts of the system need knowledge of the Users, but there is no reason for User code to hold information about how products are shipped, how they are held in the warehouse, or how they are ordered.

Now, if Warehouse systems change, the User systems might need changing. Yet Users are a basic infrastructural element that should be kept as stable as possible.

Likewise, Shipping and Supply Chain are coupled, as they should not be. These two elements do not form a cohesive unit: They have nothing to do with each other, since only the Warehouse needs to know how products are supplied to the warehouse.

Now, this organization cleans up their design.

Better cloud architecture

This architecture is a Directed Acyclic Graph (DAG). The Users mechanism provides support for all others, and so has incoming arrows and no outgoing; hopefully it is stable and changes infrequently. The front-end is dependent on the business-logic layers (Orders, Warehouse, and Shipping, and Supply-Chain), but nothing is dependent on it. This allows the UX team to rapidly evolve the app and even do A/B tests.

Tangled dependencies lead to a lot of development effort worrying about how changes will impact other projects — and this becomes even worse when dependencies are not well-understood and enforced.

Drawing the Borders

Lumping

Placing all resources in one shared account, project, or resource group will make it harder to manage access, to analyze costs, and manage resources.

For all but the smallest of organizations, it is important to adhere to the principle of least privilege: users should only be able to access what is necessary in order to do their jobs. The easiest way to implement this in a cloud environment is to group resources systematically. We can create a cloud environment dedicated to a specific application, which will be accessible only for that application’s team. This matching of the security scope to the application scope creates flexibility: for example, adding a VM to the application’s environment immediately and automatically makes it available to all team members, based on their existing roles.

Likewise, for cost analysis, it makes no sense to deal with each VM, pipeline, or bucket in isolation. If each application or business unit’s resources are grouped together, it becomes trivial to attribute each expense to its business cost center. This avoids the need for toilsome tagging and labeling of resources in shared accounts, projects, or resource groups.

System management also benefits from having correct boundaries. For example, backing up all the data sources supporting a certain application, or applying a patch to all VMs, is much easier if they are grouped together.

Of course, accounts (or GCP projects and other equivalents) are not the only way to segment: for example, you could use namespaces on a single shared Kubernetes cluster. However, microservices often have associated databases, queues, and other cloud resources. Separate groupings make for cleanly uncoupled closed units, so that integrations are only possible where explicitly granted, least-necessary-privilege is the default, and costs are simple to track.

Splitting

However, going to the other extreme — splitting up the resources into hundreds of tiny groupings — is also counterproductive, because the account, project, or group boundaries will not be meaningful.

(As an aside, note that separate units are often used for development phases like dev, testing, staging and production. This is a good way to fully decouple these stages, but is an orthogonal concern to the logical grouping of an application’s resources.)

It makes sense to right-size resource groupings, but how do you draw the borders? At the top level, Conway’s Law is your guide: the architecture will follow the organization structure, like it or not. Yet teams that have their own resources may still need to group their resources into smaller units, if their infrastructure is complex enough; this also makes it easier to handle when a team grows large enough to split.

Principles for Grouping Cloud Resources

Low coupling, high cohesion

When grouping cloud resources we can borrow a fundamental principle from software engineering: aim for low coupling and high cohesion.

Low coupling means a minimum of dependencies between groups. High cohesion means that each resource should be relevant to other resources in that group and communicate with them as needed.

An AWS account, for example, has high coupling if it has dependencies on many other accounts, and many accounts depend on it. Changes to one account can easily break another, so this is to be avoided. If resources inside an account have little relevance or connection to each other, then it displays low cohesion: for example, if lambdas and other compute resources never access databases in the same account.

No cycles

Dependencies between groupings should form a directed acyclic graph. If the dependencies form cycles of two or more such as X → Y → X, then they become a brittle, tightly-coupled unit: Changes in any of the elements in the cycle may break all others.

Stability

High-level, application-layer systems tend to change often. These tend to depend on many other systems, while few others depend on them, since frequent application changes may break any dependent infrastructure.

On the other hand, infrastructural systems tend to change infrequently. Other systems depend on infrastructure, while infrastructure dependencies are kept to a minimum. This should result in infrastructure that is stable and rarely impacted by changes to other systems.

This pattern of layering your dependencies and avoiding cycles also makes it easier to bootstrap your environment if necessary (such as for disaster response).

Ferent: A Tool for Analyzing Coupling in Cloud Resources

JDepend is a venerable tool for analyzing “instability” in Java programs, based on the graph of imports between classes. Inspired by JDepend, Ferent is a command-line tool, coded in Clojure, to calculate coupling metrics between Google projects. For each project, it can calculate coupling based on the count of service accounts (a special type of GCP account representing non-human users) in other projects that are granted a role in this project. The idea is that the other project “knows about” this project and is impacted if this project changes.

Defining coupling

It is reasonable to track coupling using service account role bindings if we assume that:

Internal integrations are authenticated (i.e., internal endpoints are not left open for unauthenticated use).
Inter-service authentication is done with service accounts. This is the best practice.
There are separate service accounts per service. This is also a best practice, but even if you use one combined service account, Ferent will still pick up that there is an inter-project coupling, just not the precise extent.

The metrics are not perfect, as there could be couplings not reflected in service accounts, such as two projects using a shared database or messaging queue in a third project, so that a format change in each project would affect the other.

The Untangled Cloud

You can use Ferent to understand and improve your architecture by minimizing coupling, eliminating cycles, and reducing your infrastructure project dependencies. The tool is most likely to come in useful when you have a number of projects tangled together, with far too many service accounts granted cross-project access.

If you carefully plan your architecture with the principles of low coupling and high cohesion, acyclic dependency graphs, and the instability metric correlated with changeability, you will never need to use it.

Article Categories:

SRE

Cloud

Programming

Last updated February 8, 2023

Authors:

Joshua is a senior cloud architect at DoiT International, a fully-remote cloud service provider where he advises tech companies about their cloud infrastructure: providing expertise on architecture and design and troubleshooting elusive failures.

[email protected]