SLURM
Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduler designed for high-performance computing clusters. It is widely used in research, academia, and industry to efficiently manage and allocate computing resources such as CPUs, GPUs, memory, and storage for running various types of jobs and tasks. Slurm helps optimize resource utilization, minimizes job conflicts, and provides a flexible framework for distributing workloads across a cluster of machines. It offers features like job prioritization, fair sharing of resources, job dependencies, and real-time monitoring, making it an essential tool for orchestrating complex computational workflows in diverse fields.
Availability¶
Software | Module Load Command |
---|---|
slurm | module load wulver |
Please note that the module wulver
is already loaded when a user logs in to the cluster. If you use module purge
command, make sure to use module load wulver
in the slurm script to load SLURM.
Application Information, Documentation¶
The documentation of SLURM is available at SLURM manual.
Managing and Monitoring Jobs¶
SLURM has numerous tools for monitoring jobs. Below are a few to get started. More documentation is available on the SLURM website.
The most common commands are:
- List all current jobs:
squeue
- Job deletion:
scancel [job_id]
- Run a job:
sbatch [submit script]
- Run a command:
srun <slurm options> <command name>
SLURM User Commands¶
Task | Command |
---|---|
Job submission: | sbatch [script_file] |
Job deletion: | scancel [job_id] |
Job status by job: | squeue [job_id] |
Job status by user: | squeue -u [user_name] |
Job hold: | scontrol hold [job_id] |
Job release: | scontrol release [job_id] |
List enqueued jobs: | squeue |
List nodes: | sinfo -N OR scontrol show nodes |
Cluster status: | sinfo |
Using SLURM on Wulver¶
In Wulver, SLURM submission will have new requirements, intended for a more fair sharing of resources without impinging on investor/owner rights to computational resources. All jobs must now be charged to a PI-group (Principal Investigator) account.
- To specify the job, use
--account=PI_ucid
. You can specify--account
as either ansbatch
or#SBATCH
parameter. If you don't know the UCID of PI, usesacctmgr show user $LOGNAME
, and you can find the PI's UCID underDef Acct
column.
[ab1234@login01 ~]$ sacctmgr show user $LOGNAME
User Def Acct Admin
---------- ---------- ---------
ab1234 xy1234 None
Partition | Nodes | Cores/Node | CPU | GPU | Memory | Service Unit (SU) Charge |
---|---|---|---|---|---|---|
--partition=general |
100 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB | 1 SU per hour per cpu |
--partition=debug |
1 | 4 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB | No charges, must be used with --qos=debug |
--partition=gpu |
25 | 128 | 2.0 GHz AMD EPYC 7713 (2) | NVIDIA A100 GPUs (4) | 512 GB | 3 SU per hour per GPU node |
--partition=bigmem |
2 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 2 TB | 1.5 SU per CPU hour |
QoS | Purpose | Rules | Wall time limit, hours | Valid Users |
---|---|---|---|---|
--qos=standard |
Normal jobs, similar to Lochness “public” access | SU charges based on node type (see partitions table above), jobs can be preempted by high QoS enqueued jobs | 72 | Everyone |
--qos=low |
Free access, no SU charge | jobs can be preempted by high or standard QoS enqueued jobs | 72 | Everyone |
--qos=high |
Only available to owners/investors | Highest Priority Jobs, no SU Charges. | 72 | owner/investor PI Groups |
--qos=debug |
Intended for debugging and testing jobs | No SU Charges, maximum 4 CPUs are allowed, must be used with --partition=debug |
8 | Everyone |
--qos=standard
on the SLURM job. It's important to regularly check the usage of SUs so that users can be aware of their consumption and switch to --qos=low
to prevent exhausting all allocated SUs. Users can check their quota using the quota_info UCID
command.
1 2 3 4 5 6 7 8 9 |
|
Example of slurm script¶
Submitting Jobs on CPU Nodes¶
Sample Job Script to use: submit.sh
#!/bin/bash -l
#SBATCH --job-name=job_nme
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=general
#SBATCH --qos=standard
#SBATCH --account=PI_ucid # Replace PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
- Here, the job requests 1 node with 8 cores, on the
general
partition withqos=standard
. Please note that the memory relies on the number of cores you are requesting. - As per the policy, users can request up to 4GB memory per core, therefore the flag
--mem-per-cpu
is used for memory requirement. - In this above script
--time
indicates the wall time which is used to specify the maximum amount of time that a job is allowed to run. The maximum allowable wall time depends on SLURM QoS, which you can find in QoS. - To submit the job, use
sbatch submit.sh
where thesubmit.sh
is the job script. Once the job has been submitted, the jobs will be in the queue, which will be executed based on priority-based scheduling. - To check the status of the job use
squeue -u $LOGNAME
and you should see the followingHere, theJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 635 general job_nme ucid R 00:02:19 1 n0088
ST
stands for the status of the job. You may see the status of the jobST
asPD
which means the job is pending and has not been assigned yet. The status change depends upon the number of users using the partition and resources requested in the job. Once the job starts, you will see the output file with an extension of.out
. If the job causes any errors, you can check the details of the error in the file with the.err
extension.
Submitting Jobs on GPU Nodes¶
In case of submitting the jobs on GPU, you can use the following SLURM script
Sample Job Script to use: gpu_submit.sh
#!/bin/bash -l
#SBATCH --job-name=gpu_job
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=gpu
#SBATCH --qos=standard
#SBATCH --account=PI_ucid # Replace PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:2
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
This will request 2 GPUS per node on the GPU
partition.
Submitting Jobs on debug
¶
The "debug" QoS in Slurm is intended for debugging and testing jobs. It usually provides a shorter queue wait time and quicker job turnaround. Jobs submitted with the "debug" QoS have access to a limited set of resources (Only 4 CPUS on Wulver), making it suitable for rapid testing and debugging of applications without tying up cluster resources for extended periods.
Sample Job Script to use: debug_submit.sh
#!/bin/bash -l
#SBATCH --job-name=debug
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=debug
#SBATCH --qos=debug
#SBATCH --account=PI_ucid # Replace PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=7:59:00 # D-HH:MM:SS, Maximum allowable Wall Time 8 hours
#SBATCH --mem-per-cpu=4000M
Interactive session on a compute node¶
Interactive sessions are useful for tasks that require direct interaction with the compute node's resources and software environment. To start an interactive session on the compute node, use the following after logging into Wulver
srun -p general -n 1 --ntasks-per-node=8 --qos=standard --account=PI_ucid --mem-per-cpu=2G --time=59:00 --pty bash
srun -p gpu -n 1 --ntasks-per-node=8 --qos=standard --account=PI_ucid --mem-per-cpu=2G --gres=gpu:2 --time=59:00 --pty bash
srun -p debug -n 1 --ntasks-per-node=4 --qos=debug --account=PI_ucid --mem-per-cpu=2G --time=59:00 --pty bash
Replace PI_ucid
with PI's NJIT UCID.
Warning
Login nodes are not designed for running computationally intensive jobs. You can use the head node to edit and manage your files, or to run small-scale interactive jobs. The CPU usage is limited per user on the head node. Therefore, for serious computing either submit the job using sbatch
command or start an interactive session on the compute node.
Note
Please note that if you are using GPUs, check whether your script is parallelized. If your script is not parallelized and only depends on GPU, then you don't need to request more cores per node. In that case use --ntasks-per-node=1
, as this will request 1 CPU per GPU. It's important to keep in mind that using multiple cores on GPU nodes may result in unnecessary CPU hour charges. Additionally, implementing this practice can make service unit accounting significantly easier.