Frameworks and libraries for distributed training
Overview from lambdalabs
- multi gpu using parameter server: reduce and broadcast done on CPU
- multi gpu all-reduce in one one, using nccl
- asynchronous distributed SGD
- synchronous distributed SGD
- multiple parameter servers
- ring all reduce distributed training
Low level
gpudirect from nvidia
- 2019 Storage: from/to NVMe devices
- 2013 RMDA: from/to network
- 2011 GPU Peer to Peer: high speed DMA
Frameworks
Apache Spark
Tensorflow
Horovod