Skip to content

giovtorres/slurm-docker-cluster

Repository files navigation

Slurm Docker Cluster

English | 简体中文

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

🏁 Getting Started

To get up and running with Slurm in Docker, make sure you have the following tools installed:

Clone the repository:

git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster

🔢 Choosing Your Slurm Version

This project supports multiple Slurm versions. To select your version, copy .env.example to .env and set SLURM_VERSION:

cp .env.example .env
# Edit .env and set:
SLURM_VERSION=25.05.3   # Latest stable (default)
# Or:
SLURM_VERSION=24.11.6   # Previous stable release

Supported versions: 25.05.x, 24.11.x

🚀 Quick Start (Using Make)

The easiest way to get started is using the provided Makefile:

# Build and start the cluster
make up

# Run tests to verify everything works
make test

# View cluster status
make status

See all available commands:

make help

📦 Containers and Volumes

This setup consists of the following containers:

  • mysql: Stores job and cluster data.
  • slurmdbd: Manages the Slurm database.
  • slurmctld: The Slurm controller responsible for job and resource management.
  • slurmrestd: REST API daemon for HTTP/JSON access to the cluster.
  • c1, c2: Compute nodes (running slurmd).

Persistent Volumes:

  • etc_munge: Mounted to /etc/munge - Authentication keys
  • etc_slurm: Mounted to /etc/slurm - Configuration files (allows live editing)
  • slurm_jobdir: Mounted to /data - Job files shared across all nodes
  • var_lib_mysql: Mounted to /var/lib/mysql - Database persistence
  • var_log_slurm: Mounted to /var/log/slurm - Log files

🛠️ Building and Starting the Cluster

Building

The easiest way to build and start the cluster is using Make:

# Build images with default version (25.05.3)
make build

# Or build and start in one command
make up

To build a different version, update SLURM_VERSION in .env:

make set-version VER=24.11.6

# Build
make build

Alternatively, use Docker Compose directly:

docker compose build

Starting

Start the cluster in detached mode:

make up

Check cluster status:

make status

View logs:

make logs

Note: The cluster automatically registers itself with SlurmDBD on first startup. Wait about 15-20 seconds after starting for all services to become healthy and auto-register.

🖥️ Using the Cluster

Accessing the Controller

Open a shell in the Slurm controller:

make shell
# Or: docker exec -it slurmctld bash

Check cluster status:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      2   idle c[1-2]

Submitting Jobs

The /data directory is shared across all nodes for job files:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2
[root@slurmctld data]# cat slurm-2.out
c1

Running Example Jobs

Use the included example scripts:

make run-examples

This runs sample jobs including simple hostname tests, CPU-intensive workloads, multi-node jobs, and more.

🔄 Cluster Management

Stop the cluster (keeps data):

make down

Restart the cluster:

make up

Complete cleanup (removes all data and volumes):

make clean

For more workflows including configuration updates, version switching, and testing, see the Common Workflows section below.

⚙️ Advanced Configuration

Live Configuration Updates

With the etc_slurm volume mounted, you can modify configurations without rebuilding:

Method 1 - Direct editing (persists across restarts):

docker exec -it slurmctld vi /etc/slurm/slurm.conf
make reload-slurm

Method 2 - Push changes from config/ directory:

# Edit config files locally in config/25.05/ or config/common/
vi config/25.05/slurm.conf

# Push to containers (automatically detects version from .env)
make update-slurm FILES="slurm.conf"

# Or update multiple files
make update-slurm FILES="slurm.conf slurmdbd.conf"

Method 3 - Rebuild image with new configs:

# For permanent changes
vi config/25.05/slurm.conf
make rebuild

This makes it easy to add/remove nodes or test new configuration settings dynamically.

📖 Common Workflows

Using Make (Recommended)

First-time Setup:

# Build and start cluster
make up

# Verify everything is working
make test

# Check cluster status
make status

Daily Development:

# View logs
make logs

# Open shell in controller
make shell

# Inside shell:
cd /data
sbatch --wrap="hostname"
squeue

Testing Changes:

# After editing config files
make down
make start
make test

Cleanup:

# Stop cluster (keeps data)
make down

# Complete cleanup (removes all data)
make clean

Example: Running Test Jobs

# Start cluster
make start

# Copy example jobs to cluster
docker cp examples/jobs slurmctld:/data/

# Submit a simple job
docker exec slurmctld bash -c "cd /data/jobs && sbatch simple_hostname.sh"

# Submit a multi-node job
docker exec slurmctld bash -c "cd /data/jobs && sbatch multi_node.sh"

# Watch job queue
docker exec slurmctld squeue

# View job outputs
docker exec slurmctld bash -c "ls -lh /data/jobs/*.out"
docker exec slurmctld bash -c "cat /data/jobs/hostname_test_*.out"

Example: Testing Different Slurm Versions

# Check current version
make version

# Build all supported versions
make build-all

# Test a specific version
make test-version VER=24.11.6

# Test all versions (comprehensive)
make test-all

# Switch to a different version and use it
make set-version VER=24.11.6
make rebuild
make test

Example: Development Workflow

# Morning: Start cluster
make start

# Work on features, test locally
make test

# Check logs if issues arise
make logs

# Evening: Stop cluster
make down

# Next day: Quick restart
make start

Makefile Commands Reference

Command Description
make help Show all available commands
make build Build Docker images
make up Start containers
make down Stop containers
make clean Remove containers and volumes
make logs Show container logs
make test Run test suite
make status Show cluster status
make shell Open shell in slurmctld
make update-slurm FILES="..." Update config files from config/ directory
make reload-slurm Reload Slurm config without restart
Multi-Version Commands
make version Show current Slurm version
make set-version VER=24.11.6 Set Slurm version in .env
make build-all Build all supported versions
make test-version VER=24.11.6 Test a specific version
make test-all Test all supported versions

🤝 Contributing

Contributions are welcomed from the community! If you want to add features, fix bugs, or improve documentation:

  1. Fork this repo.
  2. Create a new branch: git checkout -b feature/your-feature.
  3. Submit a pull request.

📄 License

This project is licensed under the MIT License.

About

A Slurm cluster using docker-compose

Topics

Resources

License

Stars

Watchers

Forks

Contributors 19