This post is part of the Datacenter scheduling series, which I’ll be covering Omega, paper published by Google back in 2013 around their work to improve their internal container orchestrator.
Google runs mixed workload in their production for better utilization and effiency, and it is the Google’s production job scheduler responsibility to decide where and when jobs/tasks gets launched. However, the scheduler has become more complicated and hard to rewrite as cluster and workloads grows increasingly. Omega is a new design that aims to allow a new scheduling model that can scale their scheduling implementation much more simply.
There are two common types of scheduler architecture that exists today. One type is Monolithic, which is a single centralized scheduler holding the full state of the cluster and making scheduling decisions. Another type is Two-Level, where a resource manager distributes partial resources of the cluster to separate scheduler “frameworks”, and each framework can make local decisions of these resources. The problem of a Monolithic scheduler is that a centralized scheduler could be the bottleneck and becoming complex to maintain when resource demands and scheduling policies scales, and the problem of Two-level scheduler is that jobs that desires the optimal placement cannot be possible as it doesn’t have the visibility of the entire cluster.
Also as workload and compute are becoming more heterogeneous, the quality and also the latency for placement decisions can matter. For example, if a short interactive job takes longer than a few seconds to schedule then it already spent more time than how the job itself will take. The ability to add and maintain these policies longer term is important.
Omega borrows from the database community and introduces a new kind of scheduler architecture which is nether two-level or monolithic. It introduces a shared-state architecture, where there are multiple schedulers just like two-level scheduling, however there is no central allocator and each scheduler has access to the full state of cluster instead of being partitioned into smaller ones. The full state is being frequently updated to each scheduler and the scheduler will make a local placement decision, and attempt to commit changes back to the shared copy with a atomic commit. If somehow there are conflicting updates, the scheduler’s transaction will be aborted and needs to retry with a new cluster state. The central component’s responsibility is only to persist state, try to commit changes and validate requests. There is no explicit fairness, and relies to each scheduler to have local limits and rely on after the fact monitoring.
The interesting aspect of this design is that if a particular scheduler becomes the bottleneck, it can also launch multiple instances of the scheduler and load balance between them. This works until conflicts and synchronizing cluster state to all schedulers becomes a bottleneck, which can happen when 100s of schedulers are running. For the purpose of scaling to 10s of schedulers the validation through simulation seems suffice.
Interestingly few years after Omega was published, the Borg paper later stated that Borg’s single scheduler policy and monolithic scheduler hasn’t been a problem in terms of scalability, and Omega was never really adopted within Google.
We do see from the Borg paper that Omega’s design did become part of Borg’s single scheduler (separate threads in the scheduler working in parallel on shared state) but not fully distributed as the paper described. Also the Omega paper also influenced all the other schedulers when it comes to optimistically schedule workloads (Mesos, K8s, Nomad, Amazon ECS).
I believe there are probably more learnings we can extract from database that applies to scheduler, hope to see more experiments like Omega happening soon.