SLURM

SLURM is a job scheduling and resource management system used at Davidson College to facilitate the use of Research Computing resources. SLURM organizes computational work into bundles of resources and code to be executed using those resources. These bundles are called jobs. You can create a job that utilizes available nodes to run your program.

Basic Usage

Submitting Jobs with sbatch

In order to run your programs through SLURM, first you must create a job-request (by writing a simple script) and then submit it to SLURM. Once SLURM receives your job-request, it allocates the necessary resources to execute your job.

You can create your job-request as a bash script and then submit the request using the sbatch [script name] command.

Example

Create a new file named "test_script.sh" and "python_script.py" in your current directory.
"test_script.sh" should have the following content:

#!/bin/bash

### The following are the job-request specifications:

#SBATCH --job-name "Test Job"   ## Name of the job  
#SBATCH --ntasks 1              ## Number of tasks for MPI workers
#SBATCH --cpus-per-task 1       ## Number of CPUs to be used for each task
#SBATCH --partition basic       ## The partition the job will run on
#SBATCH --account public        ## The group name the job will be associated with

### The following are the command(s) you wish to run:

python3 python_script.py

"python_script.py" should have the following content:

print("running a python script")

Once you save the file, submit the job-request by running the sbatch test_script.sh command.
After successfully submitting a job, you will receive a response including the job-id of the submitted job: Submitted batch job [job-id].

Note

Here are some of the most commonly used commands:

squeue --me lists all the jobs you submitted
scancel [job-id] cancels the job you submitted
sacct displays all the job information and their current state
sstat [job-id] displays the status information (memory/CPU usage, amount of steps, etc.) of a running job
scrontab for submitting reccuring jobs

Warning

Be mindful of others while running heavy computations, it may impact the workflow of other users computations.

Once a job completes, a text file containing the output of your commands is created in your home directory.

Submitting Jobs with salloc

Batch jobs submitted with the sbatch command does not allow interactive user input. You can use the salloc command for interactive job-requests instead. This command requires you to specify desired resources as options such as --nodes, --mem, and --ntasks. Once resources are allocated, a shell will be provided to you. You can run desired scripts inside this shell and debug/test them immediately.

Example

In order to allocate 1 node, 4G of memory, you can use the salloc --nodes=1 --mem=4G --ntasks=1 command. SLURM will allocate the resources and then provide a shell. The output will look something like this:

 anscott:compute0:~$ salloc --nodes=1 --mem=4G --ntasks=1
 salloc: Granted job allocation 102176
 salloc: Waiting for resource configuration
 salloc: Nodes gpu0 are ready for job
 bash-4.4$

Warning

Once you are done using the salloc command, do not forget to release the resources by exiting the bash using the exit command.

Warning

If your terminal session dies, the job submitted with sbatch will keep running unlike a job submitted within salloc. It is recommended that you submit the jobs using sbatch command.

Job Specifications

SLURM provides various ways to customize your job-request file. You can give a job-request a total time limit or a custom name among many other options. In order to make these customizations, edit your job-request file by adding options.

Example

You can add a total run time to the job-request, which will stop the job execution after a given time. For this purpose, --time [HH:MM:SS] option can be used.
Add #SBATCH --time 05:00 option below the other options in the "test_script.sh" to limit the job's execution to 5 minutes.

#!/bin/bash

#SBATCH --job-name "Test Job"  
#SBATCH --ntasks=1  
#SBATCH --cpus-per-task=1  
#SBATCH --partition basic  
#SBATCH --account public  
#SBATCH --time 05:00

./python_script.py

The complete list of options are available on the official SLURM sbatch docs page

Warning

There are limitations on the amount of resources you can use with your job-requests. If you are trying to add a job-specification such as --mem 100G, which requests 100G memory per node, the limitations may override the specification.

Further Information

You can display which clusters available to submit job-requests using the sinfo command. The default cluster is named as "basic". If you wish to use other clusters, you can reach out to ti@davidson.edu with a request.
To have access to the GPUs, you must have a faculty member to support your request.
For troubleshooting, official SLURM documentation is a good place to start.

Advanced Usage

SLURM allows execution of programs using the Davidson College Research Computing Nodes, and there are various ways to utilize the resources provided by the Compute Nodes. Depending on the program you are running, you can allocate the necessary resources to improve the execution time of your program.

Jobs and Tasks

Fundamentally, a job, consists of two things: an allotment of resources and snippets of code to execute using that allotment. On the other hand, a task is a subset of a job, that will be assigned to a single compute node to be executed.

Advanced Scheduling

Talk about important options. Each of these options changes the resources allocated. We will run through some examples.

This means that when you are submitting a job-request to SLURM, you should consider the number of tasks you want to run in parallel and specific resources that should be assigned for each task.

Single Threaded Jobs

By default, when a job is submitted to SLURM with sbatch, SLURM allocates the requested resources and runs the script using

Example

In your current directory, create two files. One is named as bash_script.sh and the other is python_task.py. bash_script.sh file will serve as your job-request file, which contains the job specifications. This contains options such as:

The job name associated with the job-request (--job-name),
Which accounts and partition it is using (--partition and --account),
How many tasks it will run in total (--ntasks),
How many nodes it will run on (--nodes),
How much memory is allocated per CPU (--mem-per-cpu), and
How many CPUs will be allocated per task (--cpus-per-task).

Copy the following into your bash_script.sh file:

#!/bin/bash

### Job Specifications

#SBATCH --job-name="Single Threaded, Single Task Job"  # job name
#SBATCH --ntasks=1                                     # number of tasks in the script
#SBATCH --nodes=1                                      # number of nodes the job will allocate
#SBATCH --mem-per-cpu=4G                               # memory allocated per CPU
#SBATCH --cpus-per-task=1                              # number of CPUs allocated per task
#SBATCH --partition=basic                              # name of the partition this job will run on
#SBATCH --account=public                               # name of the account using resources for the job

### The following are the command(s) you wish to run:

python3 python_task.py

Copy the following into your python_task.py file:

 print("Executing a single threaded job.")

You can submit the job-request using the sbatch bash_script.sh command.

Multi-Task Jobs

Each program you run in SLURM can be considered as a task. If you have multiple programs, you can list all of them using individual srun command in your job-submission file. You can scale the resource allocations using the ntasks command, which makes sure that each task that got allocated gets the same amount of resources. In addition to this, you can run each program in parallel by running them in the background.

Example

Change the contents of bash_script.sh with the following:

#!/bin/bash

#SBATCH --job-name="Single Threaded, Multi-Task Job"  
#SBATCH --ntasks=2  
#SBATCH --nodes=1  
#SBATCH --mem-per-cpu=4G  
#SBATCH --cpus-per-task=1  
#SBATCH --partition=basic  
#SBATCH --account=public

### The following are the tasks that will be launched

srun python3 ./python_task.py &     ## Task 1
srun python3 ./python_task.py &     ## Task 2
                                    ## The "&" sign runs the program in the background. 
wait                                ## "wait" pauses the script execution until 
                                       ## previously submitted tasks are completed

In this job-request file, we are running two tasks, python_task.py, that allocates: one node (--nodes=1), 4 GB of memory per CPU (--mem-per-cpu=4G), and one CPU (--cpu-per-task=1) for each of the tasks. If you had another task, it would be appropriate to change the number of tasks (--ntasks) to 3.

Each submitted task runs in the background, enabled by the "&" at the end of the command It means that both of the tasks will be running in the background in parallel, and the job will be completed only after task executions are completed.

Customized Multi-Task Jobs

When a task is being executed using sbatch, it inherits the specifications made in the job-request file. You can change the job specifications by adding the options to the srun command specificed in the job-request file.

Example

#!/bin/bash

#SBATCH --job-name="Single Threaded, Multi-Task Job"  
#SBATCH --ntasks=2  
#SBATCH --nodes=1  
#SBATCH --mem-per-cpu=4G  
#SBATCH --cpus-per-task=1  
#SBATCH --partition=basic  
#SBATCH --account=public

srun python3 ./long_python_task.py --mem-per-cpu=8G &  ## Task 1
srun python3 ./short_python_task.py &                  ## Task 2
wait

In this job-request, there are two tasks with different execution times. Assume Task 1 takes a long time to complete while Task 2 takes a shorter time. You can increase the memory allocated for Task 1 to shorten the execution time. In this case, Task 1 will inherit every option other than memory per CPU (--mem-per-cpu) and will allocate 8 GB of memory, while Task 2 will only allocate 4 GB of memory.

Multithreaded Jobs

If your program happens to support multithreading, you can also improve the execution time of the tasks by assigning additional cores to the tasks. This allows running parts of the tasks concurrently.

Example

Replace the contents of the bash_script.sh with the following:

#!/bin/bash

#SBATCH --job-name="Single Multithreaded-Task Job"
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=4G
#SBATCH --cpus-per-task=2
#SBATCH --partition=basic
#SBATCH --account=public

srun python3 python_script.py

Replace the contents of the python_script.py with the following:

import threading

# Define the function threads will execute
def test_func():
    print("Current thread is:", threading.get_ident())

# Create two threads, t1 and t2
t1 = threading.Thread(target=test_func)
t2 = threading.Thread(target=test_func)

# Start the threads to execute the function
t1.start()
t2.start()

# Wait for threads to complete their execution
t1.join()
t2.join()

print("Completed the execution of script")

In this example, python_script.py script spawns two threads that executes the same function. In the job-request script, you can specify how many CPUs will be allocated (--cpus-per-task=2) for each task. This will make sure that 2 CPUs are available in total throughout the execution of the task.

Warning

The relationship between execution time and the number of cores is not strictly linear. As you increase the number of cores, the you may not observe a significant increase in the completion time.

GPUs

The Davidson College Research Computing Nodes also have access to several GPUs (Graphical Processing Units) to accelarate computations of certain programs. Currently, the available GPUs to use are quadro_rtx_5000 and rtx_a6000.

Warning

In order to have access to use the GPUs, you must have faculty sponsorship. Contact T&I to request access to the GPUs.

You can specify the GPU you want to in your job-request file using the -gres=[gpu-name] option.

Example

One of the libraries that can make use of the GPUs is the PyTorch library in Python.

Replace the contents of the bash_script.sh with the following:

#!/bin/bash
#SBATCH --job-name="GPU Task"
#SBATCH --account=public
#SBATCH --partition=basic
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --gres=gpu:quadro_rtx_5000                  # GPU allocated for the job

module purge
module load foss PyTorch/1.12.1-CUDA-11.7.0         # Load the required modules 
python3 gpu_pytorch.py                              # Run the Python script

Replace the contents of the python_script.py file with the following:

## Import the time and pytorch libraries
from time import perf_counter
import torch

N = 500
quadro = torch.device('cuda')
example_tensor = torch.randn(N, N, dtype=torch.float64, device=quadro) ## Create a tensor with mean and variance 500
t0 = perf_counter()                                                    ## Create a timer
u, s, v = torch.svd(x)                                                 ## Compute the SVD (singular value decomposition) of a matrix
elapsed_time = perf_counter() - t0

print("Execution time: ", elapsed_time)
print("Result: ", torch.sum(s).cpu().numpy())

This script will run a basic PyTorch program that utilizes the GPU selected in the bash_script.sh file.

Note

You can only utilize the GPU if your program is specifically designed to request GPU resources. In order to check if your program is actually using the GPU allocated for your job, use nvidia-smi command.

There are two ways to allocate of the GPUs through SLURM. You can either allocate the whole GPU by using the --gres=gpu:[gpu-name] option, or the --gres=shard:[gpu-name]:[sharding-number] to allocate part of the GPU.