The primary objective of this repository is to introduce EventGraD - a novel communication algorithm
based on event-triggered communication to reduce communication in parallel machine learning.
A preliminary paper with further details can be found
here. An extended version with mathematical formulations, theoretical proofs of convergence and newer results can be found here. Please see /dmnist/event/
for the EventGraD code on MNIST and /dcifar10/event
for the EventGraD code on CIFAR-10.
The secondary objective of this repository is to serve as a starting point to implement
parallel/distributed machine learning using PyTorch C++ (LibTorch) and MPI. Apart from
EventGraD, other popular distributed algorithms such as AllReduce based training
(/dmnist/cent/
) and decentralized training with neighbors(/dmnist/decent/
)
are covered. The AllReduce based training code was contributed to the pytorch/examples
repository here through this pull request.