Serving DNNS Like Clockwork - OSDI 2020

 Clockword - OSDI 2020


Context - Inference serving system - requests have a deadline, multiple models

Challenge - Lack of data / control separation, hence unpredictability w.r.t latency deadlines - so workers are intermixing data ops (inference) with control ops (moving weights into/out of memory), booting VM etc

Key ideas : PREDICTABILITY OF INFERENCE BY CONTROL <-> DATA SEPARATION

(1) Control plane / data plane separation : 

- Worker executes only inferences

- Worker executes inference for ONLY ONE model at a given point in time - apparently multiple inferences increases throughput only by 25%, even less if inferences are batched. 

(2)  Control Node does shaping -  Under predictable worker model, control node knows how long it takes. So it is able to estimate to drop requests that will miss their deadline and thus waste work. 

(3) All memory weight load / unload instructions also have a deadline. 

(4) 2 queues in data plane - one for load/unload requests and other for inference requests. So everything is predictable. If loads are happening, control can reject requests that will miss deadline due to the load.  

(5) Inference requests also have an latest start time = deadline - exec-time. They can automatically be spared from running of the time of scheduling inference > deadline - exec-time. 

  

 



Comments

Popular posts from this blog

Pytorch DDP / FSDP Overview

GPipe - Combining Data and So Pipeline Paralllelism