Navigation :

Running Jobs with SLURM

Using a computer cluster with other users means sharing resources. SLURM (Simple Linux Utility for Resource Management) is a commonly used job scheduler that manages a queue where you submit your jobs and allocates resources to run your job when resources are available.

The documentation on using SLURM for Spartan is quite comprehensive and can be found here.

Checking the status of your jobs

squeue

You can view information about jobs in the SLURM queue with the squeue command. View the help message with squeue --usage or the manual with man squeue.

# List all jobs in the queue
squeue

# List all jobs for account UOM0041
squeue --account=UOM0041

# List all jobs in the queue for user jchung in long format
squeue -l -u jchung

If you’re using Spartan, you can also use the showq command.

scontrol show job

If you want information regarding a specific job id, you can use scontrol.

scontrol show job <job-id>

sacct

You can check the status of recently finished jobs with sacct.

sacct

sinfo

You can also view the status of the nodes in the cluster with sinfo.

# Show status of all nodes
sinfo -Nel

If you want more information about a specific node, you can use scontrol.

# View information on the master node
scontrol show node master

Running your jobs

sbatch

Most of your jobs will be sumbitted to SLURM via sbatch. The commands that you want to run need to be written in a script (a plain-text file that we’ll discuss further below), saved to a location, then submitted using sbatch.

# Print the help message from sbatch
sbatch --help

# Submit your script by specifying the name of your script
sbatch my-script.sh

sinteractive

You can use the sinteractive command to run your job in an interactive session. When SLURM allocates your job resources, you will be provided with an interactive terminal session. It is recommended to use sinteractive in conjunction with a terminal multiplexer such as GUN Screen so the job won’t terminate if you disconnect from the server.

# Print the help message for sinteractive
sinteractive --help

# Submit a job with the default parameters
sinteractive

# Submit a job with 4 CPUs, 16 GB memory, and wall time of 1 day
sinteractive --ntasks=1 --cpu-per-task=4 --mem=16384 --time=1-0:0:0

Note that the memory amount is specified in MB.

scancel

You can cancel a running job or a job in the queue with scancel.

scancel <job-id>

Writing a SLURM script

For beginners, I recommend using a job script generator. If you’re using one of the PEARG clusters on the Nectar cloud (i.e. mozzie or rescue), you can ignore the “Project ID” and the “Modules” field.

Here’s an example of a simple SLURM script running on the mozzie server.

#!/bin/bash
#SBATCH --job-name=denovo_map
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

denovo_map.pl \
    -m 3 -M 2 -n 1 -T 8 -b 1 -S \
    -o denovo_map_m3_M2_n1_2017-11-09 \
    -s ../processed_radtags/sample-1.fastq.gz \
    -s ../processed_radtags/sample-2.fastq.gz \
    -X "populations: --vcf"

The first line of the script must specify the interpreter the script will be executed with such as bash or sh. Keep this as #!/bin/bash unless you have reason to change it.

Each line starting with #SBATCH is an option that SLURM’s sbatch command uses. You can get view all available options with sbatch -h or by viewing the man page with man sbatch.

The most common #SBATCH options you’ll most likely be using are:

--job-name=XXX: You should always specify a job name for your job
--nodes=1: In most cases, you should be requesting one node so all the requested CPUs are on the same node.
--ntasks=1: In most cases you’ll be running one task per job
--cpus-per-task=X: Specify the number of CPUs to request.

You can also direct your stdout and stderr into defined files:

--output=my-file-%j.out
--error=my-file-%j.err

If you’re using Spartan, you’ll also need to specify memory in MB with:

--mem=XXXXX for jobs using multiple CPUs, or
--mem-per-cpu=XXXX for single CPU jobs.

With Spartan, you’ll also need to specify the partition, and a time limit:

-p main: The partition is called ‘main’
--time=D-HH:MM:SS: Time limit given for the job. If the job exceeds the time, it is automatically terminated.

Here’s an example of a SLURM script for Spartan’s physical partition.

#!/bin/bash

# Partition for the job:
#SBATCH -p physical

# Account to run the job:
#SBATCH --account=punim0395

# Multithreaded (SMP) job: must run on one node
#SBATCH --nodes=1

# The name of the job:
#SBATCH --job-name="test-job"

# Maximum number of tasks/CPU cores used by the job:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

# The amount of memory in megabytes per process in the job:
#SBATCH --mem=32768

# The maximum running time of the job in days-hours:mins:sec
#SBATCH --time=0-1:0:00

# Run the job from your home directory:
cd $HOME

# The job command(s):
sleep 10

When using Spartan, don’t forget to module load the software you need into your environment path.