Distributed ring all-reduce on Sim, Boston and Arbutus
Sets up distributed ring all-reduce workloads across Sim, Boston, and Arbutus environments.
Before starting, make sure all the containers are stopped and removed.
docker rm -f $(docker ps -aq)And remove all the Nextmini related networks, for example, nextmini_network.
docker network rm nextmini_networkBefore running this example, at least three linux machines (or virtual machine instances) need to be set up with Ubuntu 24.04, including one controller instance, one Docker Swarm manager, and multiple worker instances. Docker needs to be pre-installed with sudo privileges. It is suggested that the docker directory is moved out of root which usually has small disk partition. You can refer to Arbutus Cloud Deployment for setup guidance.
Step 1
On the controller instance, build controller and postgres image:
cd nextmini/examples/sba-swarm
docker build -t nextmini_controller_pytorch -f ../../controller/Dockerfile ../../
docker pull postgres:alpineThen, add the following to the controller-config.toml to ensure successful connection to controller:
[db]
user = "pgusr"
password = "pgpwrd"
host = "postgres"
database = "nextmini"
port = "5432"Controller and postgres services can be started by:
docker compose -f controller-swarm.yml build; docker compose -f controller-swarm.yml up
# docker compose -f controller-swarm.yml build --no-cache; docker compose -f controller-swarm.yml upMonitoring Dashboard
To monitor network flows and system status, you can start the dashboard on the controller instance:
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Navigate to the monitor directory and run the dashboard
cd nextmini/tools/monitor
uv run dashboard.pyOn the manager instance, <CONTROLLER_IP> in dataplane-swarm.yml should be altered accordingly.
Step 2
Build the sba-swarm base image on all manager and work instances :
cd nextmini/
docker build -t nextmini_datapath_pytorch -f ./examples/sba-swarm/Dockerfile .
# docker build --no-cache --pull -t nextmini_datapath_pytorch -f ./examples/sba-swarm/Dockerfile .Step 3
On the manager instance, start the docker swarm:
docker swarm init --advertise-addr <Manager IP>On all worker instances, join into the swarm network with the swarm token logged out:
docker swarm join --token SWMTKN-1-xxxx <SWARM_MANAGER_IP>:<port>On the manager instance, you can check the status of nodes by:
docker node lsStep 4
After all workers has joined the swarm, deploy services on the manager instance by:
cd examples/sba-swarm
docker stack deploy -c dataplane-swarm.yml nextminiTo check the status of services, use:
docker service lsStep 5
On the manager instance, find the container ID for node1 by:
docker ps -adocker exec -it <containerID> /bin/bashAfter logging into the container, we can enter into the ring-emu directory and run the ring all-reduce launcher.
cd /var/nextmini/ring-emu && uv run launch_ring.py --ring ring.txt --bin /var/nextmini/ringallreduce --no-copy --remote-dir /var/nextmini/ring-emu --len 1048 --init rank --reps 5 --verifyRing All-Reduce Launcher
A Python-based SSH launcher for distributed ring all-reduce operations across multiple nodes.
Usage
Basic Usage
Launch a ring all-reduce operation across 2 nodes:
uv run launch_ring.py \
--ring ring.txt \
--bin /var/nextmini/ringallreduce \
--no-copy \
--remote-dir /var/nextmini/ring-emu \
--len 1048576 \
--init rank \
--reps 10 \
--verifyNote
No need of --no-copy if you want to copy the binary and ring file to remote nodes. Then no need to mount ring.txt in docker-swarm.yml.
Command Line Options
--ring <FILE>: Path to ring file with IP:port per line (bind addresses)--bin <PATH>: Local path to compiled ringallreduce binary--remote-dir <DIR>: Remote working directory where ring.txt is located (use/var/nextmini/ring-emuin Docker containers)--ssh-hosts <FILE>: Optional file with SSH targets (one per line, user@host) in rank order--ssh-port <PORT>: SSH port (default: 22)--len <N>: Tensor length in elements (default: 1024)--init <PATTERN>: Initialization pattern: rank, ones, or random (default: rank)--reps <N>: Number of repetitions (default: 1)--verify: Enable verification after all-reduce--no-copy: Skip copying binary/ring file (assume already present remotely)--remote-ring-name <NAME>: Filename for ring file on remote side (default: ring.txt)--strict-host-key-checking: Enable StrictHostKeyChecking (off by default)
Ring File Format
The ring file should contain one IP:port per line:
10.0.0.1:9000
10.0.0.2:9000Directory Structure
/var/nextmini/
├── pyproject.toml # PyTorch dependencies (for training scripts)
├── ringallreduce # Ring all-reduce binary
├── lenet5.py, gpt2.py, etc. # Training scripts
└── ring-emu/ # Ring launcher (isolated environment)
├── pyproject.toml # Minimal config, no external dependencies
├── launch_ring.py # Launcher script
├── ring.txt # Ring topology file
└── README.md # This fileThe ring-emu directory has its own pyproject.toml with no external dependencies, ensuring clean execution without warnings from parent project dependencies.
Examples
All examples assume you're in the ring-emu directory:
cd /var/nextmini/ring-emuSmall test with verification
uv run launch_ring.py \
--ring ring.txt \
--bin /var/nextmini/ringallreduce \
--no-copy \
--remote-dir /var/nextmini/ring-emu \
--len 1048 \
--init rank \
--reps 5 \
--verifyLarge tensor without verification
uv run launch_ring.py \
--ring ring.txt \
--bin /var/nextmini/ringallreduce \
--no-copy \
--remote-dir /var/nextmini/ring-emu \
--len 10485760 \
--init ones \
--reps 100Using different SSH hosts
If your ring bind addresses differ from SSH endpoints:
uv run launch_ring.py \
--ring ring_bind.txt \
--ssh-hosts ssh_hosts.txt \
--bin /var/nextmini/ringallreduce \
--remote-dir /var/nextmini/ring-emu \
--len 1048576 \
--verifyQuick one-liner for testing
cd /var/nextmini/ring-emu && uv run launch_ring.py --ring ring.txt --bin /var/nextmini/ringallreduce --no-copy --remote-dir /var/nextmini/ring-emu --len 1048 --init rank --reps 5 --verifyTroubleshooting
Port already in use
If you see "Address in use" errors, kill existing processes:
pkill -9 ringallreduce
ssh 10.0.0.2 "pkill -9 ringallreduce"SSH connection issues
Ensure SSH keys are properly configured and nodes are reachable:
ssh-keyscan -H 10.0.0.1 >> ~/.ssh/known_hosts
ssh-keyscan -H 10.0.0.2 >> ~/.ssh/known_hostsBinary not found
Make sure the binary exists and is executable:
ls -la /var/nextmini/ringallreduce
chmod +x /var/nextmini/ringallreduceRing file not found
If you see "failed to read ring file" errors, ensure you're using the correct --remote-dir:
# Correct: points to ring-emu directory where ring.txt is located
--remote-dir /var/nextmini/ring-emu
# Incorrect: ring.txt is not in /var/nextmini directly
--remote-dir /var/nextminiClean up
To clean up the dataplane worker nodes: use the command below:
docker stack rm nextminiTo clean up the controller & db VM instance in DigitalOcean, use the command:
docker compose -f controller-swarm.yml down