Distributed PyTorch Trainers on Sim, Boston and Arbutus
Runs distributed PyTorch trainer workflows across Sim, Boston, and Arbutus nodes.
Before starting, make sure all the containers are stopped and removed.
docker rm -f $(docker ps -aq)And remove all the Nextmini related networks, for example, nextmini_network.
docker network rm nextmini_networkThen add "examples/sba-swarm/ring-emu" to /nextmini/Cargo.toml.
Before running this example, at least three linux machines (or virtual machine instances) need to be set up with Ubuntu 24.04, including one controller instance, one Docker Swarm manager, and multiple worker instances. Docker needs to be pre-installed with sudo privileges. It is suggested that the docker directory is moved out of root which usually has small disk partition. You can refer to Arbutus Cloud Deployment for setup guidance.
Step 1
On the controller instance, build controller and postgres image:
cd nextmini/examples/sba-swarm
docker build -t nextmini_controller_pytorch -f ../../controller/Dockerfile ../../
docker pull postgres:alpineThen, add the following to the controller-config.toml to ensure successful connection to controller:
[db]
user = "pgusr"
password = "pgpwrd"
host = "postgres"
database = "nextmini"
port = "5432"Controller and postgres services can be started by:
docker compose -f controller-swarm.yml build; docker compose -f controller-swarm.yml up
# docker compose -f controller-swarm.yml build --no-cache; docker compose -f controller-swarm.yml upOn the manager instance, <CONTROLLER_IP> in dataplane-swarm.yml should be altered accordingly.
Step 2
Build the pytorch base image on all manager and work instances :
cd nextmini/
docker build -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .
# docker build --no-cache --pull -t nextmini_datapath_pytorch -f ./examples/pytorch/Dockerfile .Step 3
On the manager instance, start the docker swarm:
docker swarm init --advertise-addr <Manager IP>On all worker instances, join into the swarm network with the swarm token logged out:
docker swarm join --token SWMTKN-1-xxxx <SWARM_MANAGER_IP>:<port>On the manager instance, you can check the status of nodes by:
docker node lsStep 4
After all workers has joined the swarm, deploy services on the manager instance by:
cd examples/sba-swarm
docker stack deploy -c dataplane-swarm.yml nextminiTo check the status of services, use:
docker service lsStep 5
On the manager instance, find the container ID for node1 by:
docker ps -adocker exec -it <containerID> /bin/bashOnce logged into node1, we can run a simple mpirun session with OpenMPI:
mpirun --allow-run-as-root -np 2 -H 10.0.0.1:1,10.0.0.2:1 -x MASTER_ADDR=node1 -x PATH -bind-to none -map-by :OVERSUBSCRIBE uv run test.pyWe should see two Hello World! printed after the Python packages are downloaded and installed.
Finally, we can start distributed training with PyTorch:
# Train Lenet5 with
sh train_lenet5.sh
# Train GPT2 with
sh train_gpt2.sh
# Train Resnet with
sh train_resnet.sh
# Train VGG16 with
sh train_vgg16.shTo train different variants of resnet, simply simply change the --type command line argument in train_resnet.sh on the manager instance.
Optional: Emit metrics via the Python dataplane API
If you want these SBA scenarios to stream intermediate loss/activation tensors through Nextmini (instead of relying solely on TUN delivery), follow the steps in Python API:
- Install the
nextmini_pywheel on the swarm nodes. - Add a small telemetry hook in your trainer script that instantiates
nextmini_py.Dataplane("/abs/path/node-config.toml"). - Build
nextmini_py.PacketViewobjects and publish metrics withsend_to_node(dst_node_id=...).
A companion receiver (launched on another trainer or analytics node) can call rx.recv() to ingest the payloads for dashboards or adaptive schedulers.
Clean up
To clean up the dataplane worker nodes: use the command below:
docker stack rm nextminiTo clean up the controller & db VM instance in DigitalOcean, use the command:
docker compose -f controller-swarm.yml downSBA Swarm for Distributed Training
Deploys two PyTorch-capable dataplane nodes in Docker Swarm with a separate controller/Postgres host, then runs distributed training tests.
Distributed ring all-reduce on Sim, Boston and Arbutus
Sets up distributed ring all-reduce workloads across Sim, Boston, and Arbutus environments.