GPipe - Combining Data and So Pipeline Paralllelism
There are 2 main ways of achieving model parallelism :
- split each layer across nodes (horizontal model parallelism) - FSDP
- assign different layers to different nodes (vertical model parallelism) - pipeline parallelism
The main challenge with approach 2 is that the way DNN training works :
Bubbles in pipeline => low efficiency.
Only advantage of single node execution is that the last step of update is parallelized.
GPipe key idea
So by adding a dose of data parallelism (micro-batches), we now get less bubbles in computation space-time diagram.
GPipe Details
Comments
Post a Comment