GPipe - Combining Data and So Pipeline Paralllelism
There are 2 main ways of achieving model parallelism : split each layer across nodes (horizontal model parallelism) - FSDP assign different layers to different nodes (vertical model parallelism) - pipeline parallelism The main challenge with approach 2 is that the way DNN training works : Bubbles in pipeline => low efficiency. Only advantage of single node execution is that the last step of update is parallelized. GPipe key idea So by adding a dose of data parallelism (micro-batches), we now get less bubbles in computation space-time diagram. GPipe Details