Autoscaling is common, as a compromise between spend and throughput of a system.
Suppose I have some service that is set to autoscale with a 40% CPU target. If it goes much below 40%, instances are killed off. If it goes much above 40%, new instances are added.
Suppose that our service contains multiple distinct autoscaling groups. Maybe we’re running independent deployments in multiple zones, or independent deployments of multiple versions.
Also, suppose that each individual instance (VM, pod, whatever) receives “load units” equally. EG RPS is evenly spread by a load balancer, or items are dequeued evenly between consumers.
Here’s how it can go wonky.
For example’s sake, let’s say we’re running a service in 2 different zones (X and Y), behind a load balancer. Autoscaling for each zone is handled separately. That makes sense, right? Isolation of the failure domain.
We have 2 instances in both zones. The group in zone X receives SLIGHTLY more traffic for some reason. As a result, it scales up. The group in Y is still within CPU bounds, and does not scale up.
Now, X has 3 instances, while Y has 2. This means that a little more traffic is going to X. As traffic increases, slightly more traffic goes to each instance. At some point, X will have enough additional traffic to scale up again. Y, which has fewer instances, gets less traffic and will not yet have reached this point.
The problem quickly cascades. Say we have 100 instances in X and 10 in Y. A 1% traffic increase will easily trigger a scaleup in X - that’s an additional instance’s worth of work! Y probably won’t scale up, as it only has an additional 10% of an instance’s worth of work to do. Crucially, as X gets bigger, it will also “steal” a bit of Y’s traffic. Not only does X scale up more, it also proportionately gets more traffic than Y compared to before the scaleup… which means it’s even more prone to further scale ups.
What went wrong here? Autoscaling is based on discrete units - EG creating/deleting VMs of fixed size. As soon as anything becomes even slightly imbalanced, discrete math bites us.
At a high level, there are 2 ways to avoid this problem:
- Divide load by group, regardless of group size
- Centralize autoscaling decisions (UPDATE: my Kubernetes PinnedDeployment project is an example of that for running multiple versions of a service in Kubernetes)
Both approaches come with their own caveats, but that’s another story.