Nextmini
ExamplesDistributed

Distributed PyTorch Trainers on Sim, Boston and Arbutus

Runs distributed PyTorch trainer workflows across Sim, Boston, and Arbutus nodes.

Before starting, make sure all the containers are stopped and removed.

docker rm -f $(docker ps -aq)

And remove all the Nextmini related networks, for example, nextmini_network.

docker network rm nextmini_network

Then add "examples/sba-swarm/ring-emu" to /nextmini/Cargo.toml.

Before running this example, at least three linux machines (or virtual machine instances) need to be set up with Ubuntu 24.04, including one controller instance, one Docker Swarm manager, and multiple worker instances. Docker needs to be pre-installed with sudo privileges. It is suggested that the docker directory is moved out of root which usually has small disk partition. You can refer to Arbutus Cloud Deployment for setup guidance.

Step 1

On the controller instance, build controller and postgres image:

cd nextmini/examples/sba-swarm
docker build -t nextmini_controller_pytorch -f ../../controller/Dockerfile ../../
docker pull postgres:alpine

Then, add the following to the controller-config.toml to ensure successful connection to controller:

[db]
user = "pgusr"
password = "pgpwrd"
host = "postgres"
database = "nextmini"
port = "5432"

Controller and postgres services can be started by:

docker compose -f controller-swarm.yml build; docker compose -f controller-swarm.yml up
# docker compose -f controller-swarm.yml build --no-cache; docker compose -f controller-swarm.yml up

On the manager instance, <CONTROLLER_IP> in dataplane-swarm.yml should be altered accordingly.

Step 2

Build the pytorch base image on all manager and work instances :

cd nextmini/
docker build -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .
# docker build --no-cache --pull -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .

Step 3

On the manager instance, start the docker swarm:

docker swarm init --advertise-addr <Manager IP>

On all worker instances, join into the swarm network with the swarm token logged out:

docker swarm join --token SWMTKN-1-xxxx <SWARM_MANAGER_IP>:<port>

On the manager instance, you can check the status of nodes by:

docker node ls

Step 4

After all workers has joined the swarm, deploy services on the manager instance by:

cd examples/sba-swarm
docker stack deploy -c dataplane-swarm.yml nextmini

To check the status of services, use:

docker service ls

Step 5

On the manager instance, find the container ID for node1 by:

docker ps -a
docker exec -it <containerID> /bin/bash

Once logged into node1, we can run a simple mpirun session with OpenMPI:

mpirun --allow-run-as-root -np 2 -H 10.0.0.1:1,10.0.0.2:1 -x MASTER_ADDR=node1 -x PATH -bind-to none -map-by :OVERSUBSCRIBE uv run test.py

We should see two Hello World! printed after the Python packages are downloaded and installed.

Finally, we can start distributed training with PyTorch:

# Train Lenet5 with
sh train_lenet5.sh

# Train GPT2 with
sh train_gpt2.sh

# Train Resnet with
sh train_resnet.sh

# Train VGG16 with
sh train_vgg16.sh

To train different variants of resnet, simply simply change the --type command line argument in train_resnet.sh on the manager instance.

Optional: Emit metrics via the Python dataplane API

If you want these SBA scenarios to stream intermediate loss/activation tensors through Nextmini (instead of relying solely on TUN delivery), follow the steps in Python API:

  1. Install the nextmini_py wheel on the swarm nodes.
  2. Add a small telemetry hook in your trainer script that instantiates nextmini_py.Dataplane("/abs/path/node-config.toml").
  3. Build nextmini_py.PacketView objects and publish metrics with send_to_node(dst_node_id=...).

A companion receiver (launched on another trainer or analytics node) can call rx.recv() to ingest the payloads for dashboards or adaptive schedulers.

Clean up

To clean up the dataplane worker nodes: use the command below:

docker stack rm nextmini

To clean up the controller & db VM instance in DigitalOcean, use the command:

docker compose -f controller-swarm.yml down

On this page