Microsoft researchers in the Systems Research Group, along with students and colleagues from Carnegie Mellon University and Stanford University, have proposed a new way to parallelize DNN training. The system, called PipeDream, achieves up to 5.3 times faster training time than traditional approaches across a range of models.
DNN training happens in iterations of forward and backward pass computations. In each iteration, the training loop processes a minibatch of input data and performs an update to the model parameters. The most common approach to parallelize DNN training is a method called data parallelism, which partitions input data across workers (accelerators).
Unfortunately, despite advances in performance optimizations to speed up data parallelism, it can suffer from high communication costs at scale when training on cloud infrastructure. Also, rapid increases in GPU compute speed over time will further shift the bottleneck of training towards communication for all models.
Deep Neural Networks (DNNs) have facilitated tremendous progress across a range of applications, including image classification, translation, language modeling, and video captioning. DNN training is extremely time-consuming, needing efficient multi-accelerator parallelization.
PipeDream, a system developed as part of Microsoft Research’s Project Fiddle, introduces pipeline parallelism, a new way to parallelize DNN training by combining traditional intra-batch parallelism (model and data parallelism) with inter-batch parallelism (pipelining).
PipeDream revisits using model parallelism for performance, as opposed to the traditional motivation of working set size limitations for training large models. It uses pipelining of multiple inputs to overcome the hardware efficiency limitations of model-parallel training. A general pipeline parallel setup involves layers split across stages, with each stage potentially replicated and running data parallel.
Multiple batches are injected into the pipeline to keep it full in steady state. Pipeline-parallel training, in most cases, communicates far lesser data than data-parallel training as it needs to communicate only the activations and gradients at the boundary of two stages. In steady state all workers are busy doing work with no pipeline stalls as in model-parallel training.
As DNNs do not always divide evenly among available workers, PipeDream may decide to use data parallelism for some stages—multiple workers can be assigned to a given stage, processing different minibatches in parallel. PipeDream uses a scheduling algorithm called 1F1B to keep hardware fully utilized, while achieving semantics similar to data parallelism.
PipeDream has been built to use PyTorch (an earlier version of PipeDream uses Caffe). Its evaluation, encompassing many combinations of DNN models, datasets, and hardware configurations, confirms the training time benefits of PipeDream’s pipeline parallelism.
Compared to data-parallel training, PipeDream reaches a high target accuracy on multi-GPU machines up to 5.3 times faster for image classification tasks, up to 3.1 times faster for machine translation tasks, 4.3 times faster for language modeling tasks, and 3 times faster for video captioning models. PipeDream is also 2.6 to 15 times faster than model parallelism and up to 1.9 times faster than hybrid parallelism.
No comments:
Post a Comment