Atalay Kutlay, Akamai
Maintaining a large hypervisor fleet at a cloud provider requires prioritizing critical infrastructure updates while limiting customer disruption. Software updates often require hypervisor restarts, triggering potentially disruptive migrations of guest VMs. Planning maintenance is complex for SREs, who must balance customer disruption, datacenter capacity, and timely completion of updates. Traditional batch-based rollout strategies often fail to account for workload distribution, leading to uneven impact and significant manual intervention.
This talk presents a production system that uses optimization-based scheduling to plan VM migrations and host updates together. The model decides when each action should occur, ensuring that only a limited number of hosts are updated at once and no customer experiences excessive simultaneous VM migrations, while completing fleet-wide updates efficiently. The resulting schedule is reviewed by SREs and executed automatically by the control plane, reducing manual planning efforts and increasing efficiency.

Atalay Kutlay is a Senior Software Engineer at Akamai. He likes to work on developing solutions to mathematical optimization problems, including virtual machine placement, scheduling, and capacity planning.

author = {Atalay Kutlay},
title = {Keeping a Hypervisor Fleet Up to Date with Minimal Customer Disruption},
year = {2026},
address = {Seattle, WA},
publisher = {USENIX Association},
month = mar
}
