Posts

GPipe - Combining Data and So Pipeline Paralllelism

Image
 There are 2 main ways of achieving model parallelism :  split each layer across nodes (horizontal model parallelism) - FSDP assign different layers to different nodes (vertical model parallelism) - pipeline parallelism The main challenge with approach 2 is that the way DNN training works :  Bubbles in pipeline => low efficiency.  Only advantage of single node execution is that the last step of update is parallelized.  GPipe key idea So by adding a dose of data parallelism (micro-batches), we now get less bubbles in computation space-time diagram.  GPipe Details

Pytorch DDP / FSDP Overview

Image
  BACKGROUND : What is stored in the GPU mem ?  a) Parameter weights b) Their gradients (first derivatives) c) Optimizer state : E.g. Adam Optimizer So what is stored in the GPU memory actually ?   Assuming FP16 and x parameters: Parameter memory : 2*x bytes (2 bytes per param) Gradient memory : 2*x bytes (2 bytes per gradient) Optimizer state : All stored as FP32s Parameter copy : 4*x (4 bytes per param - optimizer always stores in full precision)  Momentum copy : 4*x (m above)  Variance copy : 4*x  So total = 12*x, in general K*x The below picture captures this as an equation. For    𝛙 = # of params and K = 12, total memory =       [ 2(weights) + 2(gradients) + K]*    𝛙 bytes.  Above is also shown how the memory goes down as we shard different state components into N_d shards (e.g. above N_d = 64).   NCCL  A word on NCCL - NVidia Collective Communication Library  -- this is like a map-red...

Serving DNNS Like Clockwork - OSDI 2020

Image
 Clockword - OSDI 2020 Context - Inference serving system - requests have a deadline, multiple models Challenge - Lack of data / control separation, hence unpredictability w.r.t latency deadlines - so workers are intermixing data ops (inference) with control ops (moving weights into/out of memory), booting VM etc Key ideas : PREDICTABILITY OF INFERENCE BY CONTROL <-> DATA SEPARATION (1) Control plane / data plane separation :  - Worker executes only inferences - Worker executes inference for ONLY ONE model at a given point in time - apparently multiple inferences increases throughput only by 25%, even less if inferences are batched.  (2)  Control Node does shaping -  Under predictable worker model, control node knows how long it takes. So it is able to estimate to drop requests that will miss their deadline and thus waste work.  (3) All memory weight load / unload instructions also have a deadline.  (4) 2 queues in data plane - one for load/unloa...