Systems Paper Summaries

Posts

Pytorch DDP / FSDP Overview

January 12, 2024

BACKGROUND : What is stored in the GPU mem ? a) Parameter weights b) Their gradients (first derivatives) c) Optimizer state : E.g. Adam Optimizer So what is stored in the GPU memory actually ? Assuming FP16 and x parameters: Parameter memory : 2*x bytes (2 bytes per param) Gradient memory : 2*x bytes (2 bytes per gradient) Optimizer state : All stored as FP32s Parameter copy : 4*x (4 bytes per param - optimizer always stores in full precision) Momentum copy : 4*x (m above) Variance copy : 4*x So total = 12*x, in general K*x The below picture captures this as an equation. For 𝛙 = # of params and K = 12, total memory = [ 2(weights) + 2(gradients) + K]* 𝛙 bytes. Above is also shown how the memory goes down as we shard different state components into N_d shards (e.g. above N_d = 64). NCCL A word on NCCL - NVidia Collective Communication Library -- this is like a map-red...

Serving DNNS Like Clockwork - OSDI 2020

January 10, 2024

Clockword - OSDI 2020 Context - Inference serving system - requests have a deadline, multiple models Challenge - Lack of data / control separation, hence unpredictability w.r.t latency deadlines - so workers are intermixing data ops (inference) with control ops (moving weights into/out of memory), booting VM etc Key ideas : PREDICTABILITY OF INFERENCE BY CONTROL <-> DATA SEPARATION (1) Control plane / data plane separation : - Worker executes only inferences - Worker executes inference for ONLY ONE model at a given point in time - apparently multiple inferences increases throughput only by 25%, even less if inferences are batched. (2) Control Node does shaping - Under predictable worker model, control node knows how long it takes. So it is able to estimate to drop requests that will miss their deadline and thus waste work. (3) All memory weight load / unload instructions also have a deadline. (4) 2 queues in data plane - one for load/unloa...

Search This Blog

Systems Paper Summaries

Posts

GPipe - Combining Data and So Pipeline Paralllelism

Pytorch DDP / FSDP Overview

Serving DNNS Like Clockwork - OSDI 2020