Serving DNNS Like Clockwork - OSDI 2020
Clockword - OSDI 2020
Context - Inference serving system - requests have a deadline, multiple models
Challenge - Lack of data / control separation, hence unpredictability w.r.t latency deadlines - so workers are intermixing data ops (inference) with control ops (moving weights into/out of memory), booting VM etc
Key ideas : PREDICTABILITY OF INFERENCE BY CONTROL <-> DATA SEPARATION
(1) Control plane / data plane separation :
- Worker executes only inferences
- Worker executes inference for ONLY ONE model at a given point in time - apparently multiple inferences increases throughput only by 25%, even less if inferences are batched.
(2) Control Node does shaping - Under predictable worker model, control node knows how long it takes. So it is able to estimate to drop requests that will miss their deadline and thus waste work.
(3) All memory weight load / unload instructions also have a deadline.
(4) 2 queues in data plane - one for load/unload requests and other for inference requests. So everything is predictable. If loads are happening, control can reject requests that will miss deadline due to the load.
(5) Inference requests also have an latest start time = deadline - exec-time. They can automatically be spared from running of the time of scheduling inference > deadline - exec-time.
Comments
Post a Comment