Distributed Training | Winding Notes

Distributed Training

Frameworks and libraries for distributed training

Overview from lambdalabs

multi gpu using parameter server: reduce and broadcast done on CPU
multi gpu all-reduce in one one, using nccl
asynchronous distributed SGD
synchronous distributed SGD
multiple parameter servers
ring all reduce distributed training

Low level

gpudirect from nvidia

2019 Storage: from/to NVMe devices
2013 RMDA: from/to network
2011 GPU Peer to Peer: high speed DMA

Frameworks

Tensorflow

distributed training

paper
part of LF AI
synchronous operation