This paper was from SOCC 2012 and submitted by CMU.
As compute resources (cloud or on-prem) are becoming heterogeneous, different applications resource and scheduling needs are also diverse. For example, running deep learning with Tensorflow most likely runs best on GPU instances, and Spark jobs will like to get scheduled next to where its data resides. Most schedulers solve this by allowing users to specify hard constraints. However, like the previous stated examples these desires are often not mandatory and just results in a less efficient execution. If treating preferred but not required scheduling needs (soft constraints) as hard constraints can lead to low utilization (cannot use idle compute), but ignoring them also leads to lower effiency.
Alsched provides the ability to express soft constraints when scheduling a job. This is also a challenge on itself, as soft constraints can be quite complex when it comes to different allocation tradeoffs and fallbacks. For example, a user might want to give higher preference in colocating the tasks in the same rack, but if not possible then schedule anywhere else, etc. To solve this, Alsched uses composable utility functions, where each utility function maps a resource allocation decision to a utility value using different utility primitives. One example primitive can be `linear n Choose k`, where the utility value grows linearly up until a max k (e.g: preferring to scheduling on an gpu instance grows linearly up until 4 instances, where giving more than 4 isn’t more preferred).
Users can then compose utility functions with operators like Min, Max, Sum, etc. From the following examples you can see that this is quite powerful, where users can express either simple or more sophisticated needs.
The Alsched scheduler takes the composed utility functions as an input along with a job, and every time the scheduler needs to allocate resources it can try to either optimally or greedily compute a the different allocation scoring.
In their evaluation they tested enabling hard constraints only, no constraints and soft constraints, with different jobs that either perform much better colocated or ones that can simply be ran in parallel. From their results it shows that hard constraints has the slowest runtime when the speed up of locality doesn’t matter much as it needs to wait for scarce resources to be available. In this setup where the opportunity for optimal placement is scarce, soft constraints tries to evaluate both resource availability and potential speedup and biggest difference in their experiments is 10x reduction in runtime.
For future work, the authors stated that it’s hard to expect an average user to construct a good composing utility function tree, and will like to see trees automatically generated from the application, and let power users the ability to define a custom one.
Constraints as mentioned in this paper is a essential way to allow resource availability and effiency to be used in the most optimized way in schedulers. However, I’ve seen it also hard in practice for users to choose the right constraints, especially when applications and resources changes. The ability to generate a declarative soft constraint (with tradeoffs) and allow multiple soft constraints to work together will be an interesting exercise.
Mesos allows users to codify this logic in their framework, and perhaps this can also be a way to create a shared framework/scheduler logic that simplifies frameworks too.