Running a SLURM job array
Overview
Teaching: 30 min
Exercises: 60 minQuestions
How do we execute a task in parallel?
Objectives
Write a batch script for a single task.
Implement
--array
and$SLURM_ARRAY_TASK_ID
in the batch script.Submit and monitor the job script.
Often we need to run a script across many input files or samples, or we need to run a parameter sweep to determine the best values for a model. Rather than painstakingly submitting a batch job for every iteration, we can use a SLURM job array to simplify the task.
If you disconnected, log back in to the cluster.
[you@laptop:~]$ ssh SUNetID@login.farmshare.stanford.edu
An Illustrative Example
For our example job array, we are going to import an input text file. Each row of the text file represents a unique sample. We are going to process the data in each row separately, or in individual job array steps. The output for each array step will be written to its own output text file.
To begin, use nano
to create a text file called input.txt
and enter the table
below (you can copy and paste):
[SUNetID@rice-02:~]$ nano input.txt
SampleID SampleName NCats NDogs
001 Henry 0 2
002 Rob 1 1
003 Harmony 3 0
004 Nevin 0 0
Our task is to add the number of cats NCats and the number of dogs NDogs
for each sample to determine the total number of pets per sample. For each
sample, we will create an output text file called sample-<SampleID>.txt
.
The output text file will contain a line of text as follows:
<SampleName> has a total of <NCats + NDogs> pets.
1. Write and Test an SBATCH Script for a Single Sample
Write a batch script for a single iteration of your workflow. We want to make sure it runs as expected before submitting potentially thousands of copies of our job to the scheduler. This is often where the most work needs to be done, therefore this is the longest section of the example (even though there is no parallelization happening here!).
[SUNetID@rice-02:~]$ nano jobtest.sbatch
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=500m
#SBATCH --time=00:01
# Specify the input text file
input=$HOME/input.txt
# Select a single row to test
test_row=1
# Extract the SampleID
sample_id=$(awk -v i=$test_row '$1==i {print $1}' $input)
# Extract the SampleName
sample_name=$(awk -v i=$test_row '$1==i {print $2}' $input)
# Extract NCats
ncats=$(awk -v i=$test_row '$1==i {print $3}' $input)
# Extract NDogs
ndogs=$(awk -v i=$test_row '$1==i {print $4}' $input)
# Add ncats and ndogs to get npets
npets=$(expr $ncats + $ndogs)
# Specify output text filename
output=$HOME/sample-${sample_id}.txt
# Write to output file
echo "$sample_name has a total of $npets pets." >> $output
Now we can submit our test sbatch script.
[SUNetID@rice-02:~]$ sbatch jobtest.sbatch
We can check the status of our job with squeue
.
[SUNetID@rice-02:~]$ squeue --me
And then when the job is complete, we can verify that we get the expected output.
[SUNetID@rice-02:~]$ cat sample-001.txt
Henry has a total of 2 pets.
2. Set the --array
SBATCH Directive
The SBATCH directive --array
tells the scheduler how many
copies of your code should run, or rather, how many job array steps there should
be. In the case of our example, we have four samples so --array=1-4
.
- Create a copy of
jobtest.sbatch
and name itjobarray.sbatch
. - Add
#SBATCH --array=1-4
to our list of SBATCH directives.
[SUNetID@rice-02:~]$ cp jobtest.sbatch jobarray.sbatch
[SUNetID@rice-02:~]$ nano jobarray.sbatch
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=500m
#SBATCH --time=00:01
#SBATCH --array=1-4
# Specify the input text file
input=$HOME/input.txt
# Select a single row to test
test_row=1
...
3. Use the $SLURM_ARRAY_TASK_ID
Variable
Much like the iterator of a for
loop, the $SLURM_ARRAY_TASK_ID
variable
is used to handle individual tasks or job array steps. In our example
where #SBATCH --array=1-4
, we will have four separate array tasks corresponding
to our four samples. For the first array task where we process the first sample,
$SLURM_ARRAY_TASK_ID
will be set to 1
; in the second array task,
$SLURM_ARRAY_TASK_ID
will equal 2
, and so on.
In our original test of a single task, we created the variable test_row
and
set it to 1
. We then used test_row
to extract variables from a single
row of input.txt
. Using test_row
in this way was sort of like setting
SLURM_ARRAY_TASK_ID=1
.
In our production job, we will remove the line creating the test_row
variable.
We will then replace all instances of $test_row
with $SLURM_ARRAY_TASK_ID
.
[SUNetID@rice-02:~]$ nano jobarray.sbatch
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=500m
#SBATCH --time=00:01
#SBATCH --array=1-4
# Specify the input text file
input=$HOME/input.txt
# Extract the SampleID
sample_id=$(awk -v i=$SLURM_ARRAY_TASK_ID '$1==i {print $1}' $input)
# Extract the SampleName
sample_name=$(awk -v i=$SLURM_ARRAY_TASK_ID '$1==i {print $2}' $input)
# Extract NCats
ncats=$(awk -v i=$SLURM_ARRAY_TASK_ID '$1==i {print $3}' $input)
# Extract NDogs
ndogs=$(awk -v i=$SLURM_ARRAY_TASK_ID '$1==i {print $4}' $input)
# Add ncats and ndogs to get npets
npets=$(expr $ncats + $ndogs)
# Specify output text filename
output=$HOME/sample-${sample_id}.txt
# Write to output file
echo "$sample_name has a total of $npets pets." >> $output
4. Submit the Job Array
Now we can submit our job array to the scheduler. We only have to run the sbatch
command once, and SLURM will handle the creation of all the individual array tasks.
[SUNetID@rice-02:~]$ sbatch jobarray.sbatch
Submitted batch job 277394
When you submit the job array, you will receive a main job ID.
[SUNetID@rice-02:~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
277394_1 normal jobarray SUNetID R 0:06 1 wheat-01
277394_2 normal jobarray SUNetID R 0:06 1 wheat-01
277394_3 normal jobarray SUNetID R 0:06 1 wheat-01
277394_4 normal jobarray SUNetID R 0:06 1 wheat-01
Each individual array task will also receive its own subID.
Advanced Job Array Options
%N
: By default, SLURM will try to submit all your array tasks at once. If you are trying to run thousands of tasks, you will probably run into job submission limits. You can use the%N
option to limit the number of simultaneous tasks. For example,#SBATCH --array=1-100%10
will submit 100 total array tasks, 10 at a time.Select array steps: You can specify particular array steps by changing the value of
--array
.
--array=5
will only submit array task 5.--array=1,6,9
will submit array tasks 1, 6, and 9.--array=0-100:10
will submit array tasks 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100.
Key Points
Parallel programming allows applications to take advantage of parallel hardware.
The queuing system facilitates executing parallel tasks.