Running Jobs on the HSE University Supercomputer

Basic rules for performing calculations on the supercomputer

On the “login server” (the head server, sms), you can only perform such launches that do not require significant computing resources: compilation, assembly, launch tests, etc.
All computing tasks should be performed only on computing nodes.
All user tasks are performed through the Slurm job management system.

How to perform calculations on the supercomputer?

When connected to the HSE University supercomputer, you are connected to the “login server”, and not to the computing nodes.
The login server is a shared resource that all users have access to. From this server, users start tasks, view queue status, and manage their files. On this server, only launches that do not require significant computing resources are allowed: compilation, assembly, launch tests, etc.
For direct calculations, it is necessary to use the Slurm task management system through scripts and commands described below. They tell the supercomputer where to complete the task, what resources it needs and how long it will take. Slurm will automatically run the task on the appropriate nodes or queue it if the requested resources are currently unavailable.

Basic commands needed to work with the task queue

sbatch - allows you to run the task in "batch" mode (with the ability to fully prepare the environment, etc.), a priority way.
srun - allows you to perform a task in interactive mode, usually used for debugging.
squeue - display all tasks in the queue.
mj - display the status of only your tasks.
mj --exp - display expected start time of your pendind tasks.
scancel <job_id> - stop the execution of its task with specified id.

Task creation

The task consists of two parts: requesting resources and execution steps. The resource request indicates the required number of processor cores and graphics accelerators, expected execution time, memory and disk space, etc. The execution steps describe the tasks apd programs that need to be completed.
Tasks can be launched in “batch” mode (sbatch) with the ability to prepare the environment and load the necessary modules, as well as in “interactive” mode (srun), with the output of the calculation results on the screen.
Tasks in the queue are started for calculation according to priorities. In some cases, even if there are free resources, the task will not be able to occupy them if there is a higher priority task that is waiting for additional cores. One way to increase the priority of your task is to reduce the maximum possible time to complete it using the --time option.

Settlement Storage

For convenient storage of calculations and their results, it is recommended to divide them into categories, with the help of which this calculations will be performed, for example: python, MATLAB, etc. Next, create a subdirectory for a specific task in the corresponding directory, for example, my_numpy_task in the python directory. In my_numpy_task, respectively, place my_numpy_task.py script and my_numpy_task.sbatch file to start the calculation using the sbatch program. When sbatch my_numpy_task.sbatch is executed from the same directory, the calculation results will also be in it.

Running a task in batch using sbatch

The power of a cluster installation is most fully revealed when using the batch mode of task execution. For its operation, it is necessary to prepare a script file in which various resources are requested, the environment is prepared, and then the necessary programs are executed. While the task is running, the output and errors are saved to disk for further processing. Any task can be run in batch mode if it does not require a response from the user during operation. Most scientific software has special launch keys for working in batch mode.

The gemeral startup syntax is as follows: sbatch [options] script_name.sbatch or sbatch [options] --wrap="start command"

The <script_name> should be a script file that describes the required resources and steps for completing the task. The general script syntax for transfering to sbatch is as follows:

#!/bin/bash
#SBATCH <sbatch key> <value>
...
#SBATCH <sbatch key> <value>
<user commands>
...
<user commands>

Options for sbatch can be passed both on the command line when invoking the command, and in the script itself (#SBATCH ...).

Recommended use script template for sbatch:

#!/bin/bash
#SBATCH --job-name={task}            # Job name
#SBATCH --error={task}-%j.err        # File for outputting error
#SBATCH --output={task}-%j.log       # File for outputting results
#SBATCH --time={12:00:00}            # Maximum execution time
#SBATCH --ntasks={16}                # Number of MPI processes
#SBATCH --nodes={1}                  # Required number of nodes
#SBATCH --gpus={4}                   # Required GPU
#SBATCH --cpus-per-task={2}          # Required CPU number

module load {modules}                # Load a module
srun {options} {command}             # Calculation

Do not forget to change the parameters indicated in {curly brackets} to yours.

For example, to run a python calculation on 4 CPUs and 2 GPUs, which requires preparing the environment, you need to write the following script:

#!/bin/bash
#SBATCH --job-name=my_numpy_task            # Job name
#SBATCH --error=my_numpy_task-%j.err        # File for outputting error
#SBATCH --output=my_numpy_task-%j.log       # File for outputting results
#SBATCH --time=1:00:00                      # Maximum execution time
#SBATCH --cpus-per-task=4                   # Number of CPUs per task
#SBATCH --gpus=2                            # Required GPU

module load Python/Anaconda_v11.2021        # Load Anaconda module
srun python my_numpy_task.py                # Performing a calculation

The result of the program will be displayed in the file result.out.

If you do not need any preparatory actions to complete the task, you can not write a script, but wrap the command you are running through the --wrap option, for example:

sbatch -n 1 -c 2 --wrap="/home/testuser/my_program arg1 arg2"

Running Jobs in Array Mode

To run a large number of similar tasks that differ only in input parameters, for example, you can use the job array feature. Job arrays are submitted using the --array parameter. An example sbatch script using job arrays is shown below:

#!/bin/bash
#SBATCH --job-name=my_array_task            # Job name
#SBATCH --error=my_array_task-%A-%a.err     # Error output file
#SBATCH --output=my_numpy_task-%A-%a.log    # Result output file
#SBATCH --array=1-20                        # Task array with indices from 1 to 20

module load Python/Anaconda_v11.2021        # Load Anaconda module
idx=$SLURM_ARRAY_TASK_ID                    # Store the current task ID in the variable idx
echo "I am array job number" $idx           # Display the task ID
python my_array_task.py -i input_$idx.dat   # Run the computation with the input file input_$idx.dat

In this example, one master job will be created, and an array of 20 tasks will be generated. The task ID for each sub-task will be stored in the variable $SLURM_ARRAY_TASK_ID. This variable can then be conveniently used to specify the input file.

The output of each sub-task will be written to a separate file named my_numpy_task-%A-%a.log, where %A is the master job ID and %a is the sub-task ID.

For a detailed guide on using job arrays, refer to the official SLURM documentation.

Running a task interactively using srun

srun, unlike sbatch, performs the task interactively - i.e. you can redirect input and output or directly interact with the program through the terminal.
The allocation principle is similar, with the exception that the task is to allocate resources, and not the calculation itself. And after the allocation of resources begins the interactive execution of the program.
An important feature of srun is the ability to automatically control the pool of MPI processes. For example, to run an MPI program on 20 threads interactively, you can run the command srun -n 20 ./my_mpi_task
* srun can be executed inside sbatch (it can be useful for MPI tasks).

Queue Selection for Task Execution

In the Slurm job management system, several queues have been created for the convenience of users. These queues help distribute the resources of the supercomputer so that users can run stable tasks or quickly test code without long waiting times.

The following queues are available:

normal - the default queue for all users (recommended).
test - a queue for short debugging tasks. It duplicates the normal queue but tasks in it have a higher priority when starting. It does not preempt tasks in the normal queue but allows smaller fragments of the computing field to be automatically reserved for larger tasks.
gpu-ef-quick - a preemptive queue for debugging and educational tasks on E and F nodes (GPU A100 and H100). If a task for the E or F nodes appears in the higher-priority normal queue, tasks from the gpu-ef-quick queue will be automatically preempted (interrupted) and placed back in the waiting queue. Interactive tasks, such as Jupyter and salloc, will be canceled without restarting when preempted. E and F nodes (GPU A100 and H100) are not available to students according to the decision of the NTU SKC, but this queue will allow student computations when expensive GPUs are idle.
cpu-e-quick - a preemptive queue for fast CPU computations on GPU nodes of type E (non-profiled workload for GPU nodes). If a task for GPU nodes type E appears in the higher-priority normal queue, tasks from the cpu-e-quick queue will be automatically preempted and placed back in the queue. Not all tasks are preempted, only those that prevent specific GPUs from being utilized.

To submit tasks to a specific queue, you must use the -p option. For example, sbatch -p gpu-ef-quick my_task.

Selecting a Project for Task Execution

Users who are members of multiple active projects must explicitly specify the project in which the task will be executed. To do this, the --account <Project_ID> or -A <Project_ID> parameter must be specified in the sbatch script or any task execution command (sbatch/srun).

For example, you can add the parameter when submitting a job using sbatch -A proj_1 train.sbatch or specify it directly in the sbatch script:

#!/bin/bash
#SBATCH --job-name=my_task                  # Job name
#SBATCH --account=proj_1                    # Project identifier
#SBATCH --time=1:00:00                      # Maximum runtime
#SBATCH --cpus-per-task=4                   # Number of CPUs per task
#SBATCH --nodes=1                           # Number of nodes required
#SBATCH --gpus=2                            # Number of GPUs required

module load Python/Anaconda_v11.2021        # Load Anaconda module
python train.py                             # Run the computation

The list of projects the user is a member of can be viewed in the supercomputer console using the mp command.

Automatic restart of tasks

Slurm task scheduler supports automatic task restart. This function is useful when using software that uses the checkpoint mechanism during calculation, for example, GROMACS or ANSYS. The task will be restarted automatically in the event of a node failure or after maintenance. To disable the ability to restart when queuing a task, add the --no-requeue option:

sbatch --job-name=myjob --time=5-0:0 --no-requeue --wrap="/home/testuser/simple_program"

Task priority

The task queue on the supercomputer is built on the basis of priorities – the higher the priority of the task, the close it is to the beginning of the queue. The task at the very beginning of the queue reserves for itself the required system resources as the become available. That is, even if there are free computing nodes, they cannot be occupied if there is a higher priority task that is waiting for additional resources.
There is an exception to this rule – if the execution of task X with a lower priority does not affect the start time of task Y with a higher priority, then task X can be launched before Y. This scheduling method is supported by the backfill scheduling plugin. One way to increase the priority of your task is to reduce the maximum possible time to complete it using the --time and --deadline options.

How to Determine the Estimated Start Date of a Task in the Queue

The scheduler performs multi-criteria optimization of the task flow, so the start date of tasks requiring different amounts of resources or having different time limits for execution will vary. The estimated start time is automatically calculated only after the task has been placed in the queue.

Beforehand, it is recommended to check the current cluster load using the freenodes command. To speed up the task start, you can adjust the task size according to the available resources and then submit it to the queue. Tasks that fit the available resources will be executed before larger ones if they have a shorter time limit.

The estimated start date and time of tasks in the queue can be viewed using the command: squeue --start.

Selecting the type of computing node

The HSE supercomputer uses five types of computing nodes (type_a, type_b, type_c, type_d, type_e). By default, tasks are started on any free node according to the priority: d -> a -> c -> b -> e. Only nodes that satisfy the requested resources are selected (for example, type_d will not be considered in the case of a GPU request).
To force the selection of the node type on which the task should be run, you must use the option --constraint=<node_type> or -C <node_type>. There are option to use a disjunction (logical OR): --constraint="type_c|type_d" (quotation marks are required) - in this case, both type_c and type_d nodes can be allocated for the task.

In the case when it is necessary that all the allocated nodes must be identical in type, but the type itself can vary, then the enumeration of types is indicated in square brackets: --constraint="[type_a|type_b]" - in this case, all the selected nodes will be either only type_a or only type_b.

The list of computing node types is available by the nodetypes command.

Notification of Task Status Change via E-mail

To receive notifications about task status changes via e-mail, add the --mail-type and --mail-user parameters to the sbatch command.

Parameter format:

--mail-user=example@hse.ru
--mail-type=ALL

Where instead of ALL, various event types can be specified (see the link for details).

Main options sbatch, srun

-n <number> (aka --ntasks=<number>) - the number of running processes;
-c <number> (aka --cpus-per-task=<number>) - the number of CPU cores for each process;
-G <number> (aka --gpus=<number>) - the number of GPUs;
-N <number> (aka --nodes=<number>) - the minimum number of nodes allocated for the task;
--ntasks-per-node=<number> - the number of processes per each node (fox example, "-n 4 --ntasks-per-node=2" — this means that you need 2 nodes to complete these 4 processes);
--gpus-per-task=<number> - the number of GPUs for each running process;
--gpus-per-node=<number> - the number of GPUs for each allocated node;
--cpus-per-gpu=<number> - the number of CPUs for each allocated GPU;

Additional information

sbatch -N 2 -G 8 – run the task on 2 nodes and 8 GPUs (on each node 4 GPUs - it is impossible to allocate more than 8 GPUs for 2 nodes).
sbatch -N 2 -G 4 --gpus-per-node=2 – run the task on 2 nodes and 4 GPUs, using 2 GPUs on each node.
sbatch -N 2 --gpus-per-node=4 – run the task on 2 nodes, using all 4 GPUs on each node.
sbatch -n 4 --gpus-per-task=1 – run the task with 4 processes, for each of which 1 GPU is allocated (4 GPUs in total).
sbatch -N 4 --ntasks-per-node=2 --gpus-per-task=2 – run the task on 4 nodes, 2 processes on each node, each of which uses 2 GPUs (16 GPUs in total).
sbatch -G 4 --cpus-per-gpu=1 – run the task usinng 4 GPUs and 4 CPUs (1 CPU for each allocated GPU)

--time=<number> - the required time to complete the task (in minutes), can be specified in the following formats:
[minutes | minutes:seconds | hours:minutes:seconds | days-hours | days-hours:minutes | days-hours:minutes:seconds]
By default, the maximum task execution time is limited to 24 hours. If the task requires more time, then it is necessary to use the --time parameter in the script file
--partition=<queue name>: the queue in which the task will be assigned (by default, “normal”);
--job-name=<task name>: the name of the task in the queue;
--deadline=<run time>: limit on the run time after which the task will be forced to complete;
--output=<file name>: the file where the output will be recorded during the execution of the task;
--wrap=<command line>: task launch line.

* Detailed information on using the Slurm scheduler is available on the official website.

Sbatch script files examples

1. Starting a simple task using sbatch with the preparation of the script file: sbatch myscript.sh Where the contents of myscript.sh are:

#!/bin/bash
#SBATCH --job-name=myjob # The name of the task in the queue
#SBATCH --time=5-0:0 # Maximum lead time (5 days)
#SBATCH --output=myjob.slurm.log # Output the result to myjob.slurm.log
#SBATCH --cpus-per-task=8 # Performing calculations on 8 CPU cores
#SBATCH --nodes=1 # All allocated kernels must be on 1 node

/home/testuser/simple_program # Launch software

2. Starting a simple task using sbatch indicating the requested resources (equivalent to the previous one):

sbatch --partition=normal --job-name=myjob --time=5-0:0 --output=myjob.%j.log --wrap="/home/testuser/simple_program"

3. Starting the terminal with bash with access to 1 GPU and 4 CPUs:

srun --pty --partition gpu-1 --ntasks 4 --gpus=1 bash

4. Launching a task on 8 cores on one node:

sbatch --cpus-per-task=8 one_node_program

5. Starting the MPI task on 2 nodes on 24 cores with a limit of 1 hour (myscript.sh):

#!/bin/bash
#SBATCH --job-name=mpi_job_test # Name of the task in the queue
#SBATCH --ntasks=24 # Number of MPI processes (MPI ranks)
#SBATCH --cpus-per-task=1 # Number of cores per process
#SBATCH --nodes=2 # Number of nodes for the task

#SBATCH --ntasks-per-node=12 # Number of processes on each node
#SBATCH --time=00:05:00 # Limitation of the task execution time (days-hours:min:sec)
#SBATCH --output=mpi_test_%j.log # Path to the output file relative to the working directory

echo "Date = $(date)"
echo "Hostname = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

# Loagin the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce

srun /home/testuser/mpi_program

6. Consistent execution of several parallel tasks. Tasks will be performed on the same cores, one after another. This option is useful when each next task depends on the data recieved from the previous one.

#!/bin/bash
#SBATCH --ntasks=24 # Number of MPI processes
#SBATCH --cpus-per-task=1 # Number of cores per process
module purge # Clean environment variables

# Loading the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce

# Sequential launch of tasks
srun ./a.out > task1_output 2>&1
srun ./b.out > task2_output 2>&1
srun ./c.out > task3_output 2>&1

7. Simultaneous execution of several parallel tasks. Note the «wait» command at the end of the script. It allows the scheduler to wait for completion of all tasks.

#!/bin/bash
#SBATCH --ntasks 56 # Number of MPI processes
#SBATCH --cpus-per-task=1 # Number of cores per process
module purge # Clean environment variables

# Loading the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce

# Simultaneous task launch
srun -n 8 --cpu_bind=cores ./a.out > a.output 2>&1 &
srun -n 38 --cpu_bind=cores ./b.out > b.output 2>&1 &
srun -n 10 --cpu_bind=cores ./c.out > c.output 2>&1 &

# Waiting for all processes to complete
wait

8. Launching neural network training on 4 GPUs and 16 CPUs:

#!/bin/bash
#SBATCH --job-name=net_training
#SBATCH --error=stderr.%j.txt
#SBATCH --output=stdout.%j.txt
#SBATCH --partition=normal
#SBATCH --gpus=4
#SBATCH --cpus-per-task 16

# Activation of the Anaconda environment
# Before activating the environment, it is recommended to deactivate the current
source deactivate
source activate pytorch

echo "Working on node `hostname`"
echo "Assigned GPUs: $CUDA_VISIBLE_DEVICES"
cd /home/testuser/pytorch
export NGPUS=4
python -m torch.distributed.launch --nproc_per_node=$NGPUS /home/testuser/pytorch/train_net.py --config-file "configs/network_X_101_32x8d_FPN_1x.yaml"

Note for calculations in python with output to terminal

An important note for python calculations that show the results in the output:
By default python uses buffered output (i.e. the output is NOT instantly piped to the resulting file).
For example, in the case of code like this:

from time import sleep
print ('Start')
for i in range(60):
sleep(1)
print ('Done')

The initial "Start" output will only appear in the output file (slurm-<jobid>.out) after 60 iterations (along with "Done").
For tasks whose results are evaluated during execution (intermediate values, etc. are displayed), this is not suitable (for example, if the task ends by timeout, there will be no output in the file).
There are 2 options for instant display of results:
1. Add the argument flush=True to the print function: "print ('Start', flush=True)" (this only works with python version python >= v3.3)
2. Add the -u argument when starting the task to the interpreter itself: for example, "python3 -u myprog.py" (no change in the code are needed in this case)

Application launch

To run application software (Gromacs, Octave, BEAST, IQ-Tree, MATLAB and etc.), see examples of their use in the Application Software section of the HSE University Supercomputer.

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.

Supercomputer Modeling Unit