Distributed PyTorch Trainers
Runs distributed PyTorch trainer nodes using OpenMPI on a single host.
Running a Distributed PyTorch Trainer with OpenMPI on a Single Machine
Nextmini is designed to facilitate distributed machine learning training. We now show a simple example of training an MNIST model between multiple docker containers using PyTorch's own distributed data parallel framework and OpenMPI. All docker containers will be launched on the same physical machine (Linux or macOS).
Before starting to build the docker image, it is recommended to start from a clean slate:
docker system prune -aThis will remove all stopped containers, all networks not used by at least one container, all images without at least one container associated to them, as well as all build cache. If you wish to remove all existing volumes at the same time as well, run:
docker system prune -a --volumes -fTo build and run the docker image in this example, run the following in the examples/pytorch directory:
docker compose build && docker compose upThis will start four Nextmini dataplane nodes with OpenMPI installed, and connect them to a single Nextmini controller. To start training, open another terminal and attach to node1 with
docker exec -it node1 /bin/bashOnce we are logged into node1, we can run a simple mpirun session with OpenMPI:
mpirun --allow-run-as-root -np 4 echo hello worldWe can also run a Python script using uv:
mpirun --allow-run-as-root -np 4 -H 10.0.0.1:1,10.0.0.2:1,10.0.0.3:1,10.0.0.4:1 -x MASTER_ADDR=node1 -x PATH -bind-to none -map-by :OVERSUBSCRIBE uv run test.pyWe should see four Hello World! printed after the Python packages are downloaded and installed.
Finally, we can start distributed training with PyTorch:
sh train_lenet5.shThis should start a training session for a LeNet-5 model to be trained with the MNIST dataset across four training nodes for 10 epochs, each running in its own Docker container.
Optional: Stream training metrics through the Python dataplane API
When you want to push tensors or scalar metrics directly into the Nextmini dataplane from the trainers, use the Python bindings described in Python API:
- Build and install the
nextmini_pywheel (maturin build --release -m python-api/Cargo.toml; pip install target/wheels/nextmini_py-*.whl). - In your trainer script, add a telemetry hook that instantiates
nextmini_py.Dataplane("/abs/path/node-config.toml"), builds aPacketViewfrom each tensor, and sends withsend_to_node(dst_node_id=...). - Run the training job (for example
python examples/pytorch/gpt2.py --num-epochs 1).
The current examples/pytorch/*.py files do not include this telemetry hook by default, so add it explicitly where needed. On the destination node you can mirror the setup with another Python worker and call rx.recv(timeout_ms=2000) to consume metrics. The bindings reuse the same routing tables as the Rust dataplane, so multicast fan-out and QoS policies apply automatically.
Running a Distributed PyTorch Trainer across Multiple Machines
Warning The following instructions have not been verified to work correctly.
Before starting, make sure all the containers are stopped and removed.
docker rm -f $(docker ps -aq)And remove all the Nextmini related networks, for example, nextmini_network.
docker network rm nextmini_networkBefore running this example, at least three linux machines (or virtual machine instances) need to be set up with Ubuntu 24.04, including one controller instance, one Docker Swarm manager, and multiple worker instances. Docker needs to be pre-installed with sudo privileges. It is suggested that the docker directory is moved out of root which usually has small disk partition. You can refer to Arbutus Cloud Deployment for setup guidance.
Step 1
On the controller instance, build controller and postgres image:
cd nextmini/examples/pytorch
docker build -t nextmini_controller_pytorch -f ../../controller/Dockerfile ../../
docker pull postgres:alpineThen, add the following to the controller-config.toml to ensure successful connection to controller:
[db]
user = "pgusr"
password = "pgpwrd"
host = "postgres"
database = "nextmini"
port = "5432"Controller and postgres services can be started by:
docker compose -f controller-swarm.yml build; docker compose -f controller-swarm.yml up
# docker compose -f controller-swarm.yml build --no-cache; docker compose -f controller-swarm.yml upOn the manager instance, <CONTROLLER_IP> in dataplane-swarm.yml should be altered accordingly.
Step 2
Build the pytorch base image on all manager and work instances :
cd nextmini/
docker build -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .
# docker build --no-cache --pull -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .Step 3
On the manager instance, start the docker swarm:
docker swarm init --advertise-addr <Manager IP>On all worker instances, join into the swarm network with the swarm token logged out:
docker swarm join --token SWMTKN-1-xxxx <SWARM_MANAGER_IP>:<port>On the manager instance, you can check the status of nodes by:
docker node lsStep 4
After all workers has joined the swarm, deploy services on the manager instance by:
cd examples/pytorch
docker stack deploy -c dataplane-swarm.yml nextminiTo check the status of services, use:
docker service lsStep 5
On the manager instance, find the container ID for node1 by:
docker ps -adocker exec -it <containerID> /bin/bashOnce logged into node1, we can run a simple mpirun session with OpenMPI:
mpirun --allow-run-as-root -np 4 -H 10.0.0.1:1,10.0.0.2:1,10.0.0.3:1,10.0.0.4:1 echo hello worldWe can also run a Python script using uv:
mpirun --allow-run-as-root -np 4 -H 10.0.0.1:1,10.0.0.2:1,10.0.0.3:1,10.0.0.4:1 -x MASTER_ADDR=node1 -x PATH -bind-to none -map-by slot uv run test.pyWe should see four Hello World! printed after the Python packages are downloaded and installed.
Finally, we can start distributed training with PyTorch:
# Train Lenet5 with
sh train_lenet5.sh
# Train GPT2 with
sh train_gpt2.sh
# Train Resnet with
sh train_resnet.sh
# Train VGG16 with
sh train_vgg16.shTo train different variants of resnet, simply simply change the --type command line argument in train_resnet.sh on the manager instance.
Clean up
To clean up the dataplane worker nodes: use the command below:
docker stack rm nextminiTo clean up the controller & db VM instance in DigitalOcean, use the command:
docker compose -f controller-swarm.yml down