Running Jobs on the HSE University Supercomputer
Basic rules for performing calculations on the supercomputer
- On the “login server” (the head server, sms), you can only perform such launches that do not require significant computing resources: compilation, assembly, launch tests, etc.
- All computing tasks should be performed only on computing nodes.
- All user tasks are performed through the Slurm job management system.
How to perform calculations on the supercomputer?
When connected to the HSE University supercomputer, you are connected to the “login server”, and not to the computing nodes.
The login server is a shared resource that all users have access to. From this server, users start tasks, view queue status, and manage their files. On this server, only launches that do not require significant computing resources are allowed: compilation, assembly, launch tests, etc.
For direct calculations, it is necessary to use the Slurm task management system through scripts and commands described below. They tell the supercomputer where to complete the task, what resources it needs and how long it will take. Slurm will automatically run the task on the appropriate nodes or queue it if the requested resources are currently unavailable.
Basic commands needed to work with the task queue
sbatch - allows you to run the task in "batch" mode (with the ability to fully prepare the environment, etc.), a priority way.
srun - allows you to perform a task in interactive mode, usually used for debugging.
squeue - display all tasks in the queue.
mj - display the status of only your tasks.
mj --exp - display expected start time of your pendind tasks.
scancel <job_id> - stop the execution of its task with specified id.
Task creation
The task consists of two parts: requesting resources and execution steps. The resource request indicates the required number of processor cores and graphics accelerators, expected execution time, memory and disk space, etc. The execution steps describe the tasks apd programs that need to be completed.
Tasks can be launched in “batch” mode (sbatch) with the ability to prepare the environment and load the necessary modules, as well as in “interactive” mode (srun), with the output of the calculation results on the screen.
Tasks in the queue are started for calculation according to priorities. In some cases, even if there are free resources, the task will not be able to occupy them if there is a higher priority task that is waiting for additional cores. One way to increase the priority of your task is to reduce the maximum possible time to complete it using the --time option.
Settlement Storage
For convenient storage of calculations and their results, it is recommended to divide them into categories, with the help of which this calculations will be performed, for example: python, MATLAB, etc. Next, create a subdirectory for a specific task in the corresponding directory, for example, my_numpy_task in the python directory. In my_numpy_task, respectively, place my_numpy_task.py script and my_numpy_task.sbatch file to start the calculation using the sbatch program. When sbatch my_numpy_task.sbatch is executed from the same directory, the calculation results will also be in it.
Running a task in batch using sbatch
The power of a cluster installation is most fully revealed when using the batch mode of task execution. For its operation, it is necessary to prepare a script file in which various resources are requested, the environment is prepared, and then the necessary programs are executed. While the task is running, the output and errors are saved to disk for further processing. Any task can be run in batch mode if it does not require a response from the user during operation. Most scientific software has special launch keys for working in batch mode.
The gemeral startup syntax is as follows: sbatch [options] script_name.sbatch or sbatch [options] --wrap="start command"
The <script_name> should be a script file that describes the required resources and steps for completing the task. The general script syntax for transfering to sbatch is as follows:
#!/bin/bash #SBATCH <sbatch key> <value> ... #SBATCH <sbatch key> <value> <user commands> ... <user commands>
Options for sbatch can be passed both on the command line when invoking the command, and in the script itself (#SBATCH ...).
Recommended use script template for sbatch:
#!/bin/bash #SBATCH --job-name={task} # Job name #SBATCH --error={task}-%j.err # File for outputting error #SBATCH --output={task}-%j.log # File for outputting results #SBATCH --time={12:00:00} # Maximum execution time #SBATCH --ntasks={16} # Number of MPI processes #SBATCH --nodes={1} # Required number of nodes #SBATCH --gpus={4} # Required GPU #SBATCH --cpus-per-task={2} # Required CPU number module load {modules} # Load a module srun {options} {command} # Calculation
Do not forget to change the parameters indicated in {curly brackets} to yours.
For example, to run a python calculation on 4 CPUs and 2 GPUs, which requires preparing the environment, you need to write the following script:#!/bin/bash #SBATCH --job-name=my_numpy_task # Job name #SBATCH --error=my_numpy_task-%j.err # File for outputting error #SBATCH --output=my_numpy_task-%j.log # File for outputting results #SBATCH --time=1:00:00 # Maximum execution time #SBATCH --cpus-per-task=4 # Number of CPUs per task #SBATCH --gpus=2 # Required GPU module load Python/Anaconda_v11.2021 # Load Anaconda module srun python my_numpy_task.py # Performing a calculation
The result of the program will be displayed in the file result.out.
If you do not need any preparatory actions to complete the task, you can not write a script, but wrap the command you are running through the --wrap option, for example:
sbatch -n 1 -c 2 --wrap="/home/testuser/my_program arg1 arg2"
Running a task interactively using srun
srun, unlike sbatch, performs the task interactively - i.e. you can redirect input and output or directly interact with the program through the terminal.
The allocation principle is similar, with the exception that the task is to allocate resources, and not the calculation itself. And after the allocation of resources begins the interactive execution of the program.
An important feature of srun is the ability to automatically control the pool of MPI processes. For example, to run an MPI program on 20 threads interactively, you can run the command srun -n 20 ./my_mpi_task
* srun can be executed inside sbatch (it can be useful for MPI tasks).
Automatic restart of tasks
Slurm task scheduler supports automatic task restart. This function is useful when using software that uses the checkpoint mechanism during calculation, for example, GROMACS or ANSYS. The task will be restarted automatically in the event of a node failure or after maintenance. To disable the ability to restart when queuing a task, add the --no-requeue option:
sbatch --job-name=myjob --time=5-0:0 --no-requeue --wrap="/home/testuser/simple_program"
Task priority
The task queue on the supercomputer is built on the basis of priorities – the higher the priority of the task, the close it is to the beginning of the queue. The task at the very beginning of the queue reserves for itself the required system resources as the become available. That is, even if there are free computing nodes, they cannot be occupied if there is a higher priority task that is waiting for additional resources.
There is an exception to this rule – if the execution of task X with a lower priority does not affect the start time of task Y with a higher priority, then task X can be launched before Y. This scheduling method is supported by the backfill scheduling plugin. One way to increase the priority of your task is to reduce the maximum possible time to complete it using the --time and --deadline options.
Selecting the type of computing node
The HSE supercomputer uses five types of computing nodes (type_a, type_b, type_c, type_d, type_e). By default, tasks are started on any free node according to the priority: d -> a -> c -> b -> e. Only nodes that satisfy the requested resources are selected (for example, type_d will not be considered in the case of a GPU request).
To force the selection of the node type on which the task should be run, you must use the option --constraint=<node_type> or -C <node_type>. There are option to use a disjunction (logical OR): --constraint="type_c|type_d" (quotation marks are required) - in this case, both type_c and type_d nodes can be allocated for the task.
In the case when it is necessary that all the allocated nodes must be identical in type, but the type itself can vary, then the enumeration of types is indicated in square brackets: --constraint="[type_a|type_b]" - in this case, all the selected nodes will be either only type_a or only type_b.
The list of computing node types is available by the nodetypes command.
Main options sbatch, srun
-n <number> (aka --ntasks=<number>) - the number of running processes;
-c <number> (aka --cpus-per-task=<number>) - the number of CPU cores for each process;
-G <number> (aka --gpus=<number>) - the number of GPUs;
-N <number> (aka --nodes=<number>) - the minimum number of nodes allocated for the task;
--ntasks-per-node=<number> - the number of processes per each node (fox example, "-n 4 --ntasks-per-node=2" — this means that you need 2 nodes to complete these 4 processes);
--gpus-per-task=<number> - the number of GPUs for each running process;
--gpus-per-node=<number> - the number of GPUs for each allocated node;
--cpus-per-gpu=<number> - the number of CPUs for each allocated GPU;
Additional information
sbatch -N 2 -G 8 – run the task on 2 nodes and 8 GPUs (on each node 4 GPUs - it is impossible to allocate more than 8 GPUs for 2 nodes).
sbatch -N 2 -G 4 --gpus-per-node=2 – run the task on 2 nodes and 4 GPUs, using 2 GPUs on each node.
sbatch -N 2 --gpus-per-node=4 – run the task on 2 nodes, using all 4 GPUs on each node.
sbatch -n 4 --gpus-per-task=1 – run the task with 4 processes, for each of which 1 GPU is allocated (4 GPUs in total).
sbatch -N 4 --ntasks-per-node=2 --gpus-per-task=2 – run the task on 4 nodes, 2 processes on each node, each of which uses 2 GPUs (16 GPUs in total).
sbatch -G 4 --cpus-per-gpu=1 – run the task usinng 4 GPUs and 4 CPUs (1 CPU for each allocated GPU)
--time=<number> - the required time to complete the task (in minutes), can be specified in the following formats:
[minutes | minutes:seconds | hours:minutes:seconds | days-hours | days-hours:minutes | days-hours:minutes:seconds]
By default, the maximum task execution time is limited to 24 hours. If the task requires more time, then it is necessary to use the --time parameter in the script file
--partition=<queue name>: the queue in which the task will be assigned (by default, “normal”);
--job-name=<task name>: the name of the task in the queue;
--deadline=<run time>: limit on the run time after which the task will be forced to complete;
--output=<file name>: the file where the output will be recorded during the execution of the task;
--wrap=<command line>: task launch line.
* Detailed information on using the Slurm scheduler is available on the official website.
Sbatch script files examples
1. Starting a simple task using sbatch with the preparation of the script file: sbatch myscript.sh Where the contents of myscript.sh are:
#!/bin/bash
#SBATCH --job-name=myjob # The name of the task in the queue
#SBATCH --time=5-0:0 # Maximum lead time (5 days)
#SBATCH --output=myjob.slurm.log # Output the result to myjob.slurm.log
#SBATCH --cpus-per-task=8 # Performing calculations on 8 CPU cores
#SBATCH --nodes=1 # All allocated kernels must be on 1 node
/home/testuser/simple_program # Launch software
2. Starting a simple task using sbatch indicating the requested resources (equivalent to the previous one):
sbatch --partition=normal --job-name=myjob --time=5-0:0 --output=myjob.%j.log --wrap="/home/testuser/simple_program"
3. Starting the terminal with bash with access to 1 GPU and 4 CPUs:
srun --pty --partition gpu-1 --ntasks 4 --gpus=1 bash
4. Launching a task on 8 cores on one node:
sbatch --cpus-per-task=8 one_node_program
5. Starting the MPI task on 2 nodes on 24 cores with a limit of 1 hour (myscript.sh):
#!/bin/bash
#SBATCH --job-name=mpi_job_test # Name of the task in the queue
#SBATCH --ntasks=24 # Number of MPI processes (MPI ranks)
#SBATCH --cpus-per-task=1 # Number of cores per process
#SBATCH --nodes=2 # Number of nodes for the task
#SBATCH --ntasks-per-node=12 # Number of processes on each node
#SBATCH --time=00:05:00 # Limitation of the task execution time (days-hours:min:sec)
#SBATCH --output=mpi_test_%j.log # Path to the output file relative to the working directory
echo "Date = $(date)"
echo "Hostname = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
# Loagin the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce
srun /home/testuser/mpi_program
6. Consistent execution of several parallel tasks. Tasks will be performed on the same cores, one after another. This option is useful when each next task depends on the data recieved from the previous one.
#!/bin/bash
#SBATCH --ntasks=24 # Number of MPI processes
#SBATCH --cpus-per-task=1 # Number of cores per process
module purge # Clean environment variables
# Loading the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce
# Sequential launch of tasks
srun ./a.out > task1_output 2>&1
srun ./b.out > task2_output 2>&1
srun ./c.out > task3_output 2>&1
7. Simultaneous execution of several parallel tasks. Note the «wait» command at the end of the script. It allows the scheduler to wait for completion of all tasks.
#!/bin/bash
#SBATCH --ntasks 56 # Number of MPI processes
#SBATCH --cpus-per-task=1 # Number of cores per process
module purge # Clean environment variables
# Loading the necessary environment variables
module load INTEL/parallel_studio_xe_2020_ce
# Simultaneous task launch
srun -n 8 --cpu_bind=cores ./a.out > a.output 2>&1 &
srun -n 38 --cpu_bind=cores ./b.out > b.output 2>&1 &
srun -n 10 --cpu_bind=cores ./c.out > c.output 2>&1 &
# Waiting for all processes to complete
wait
8. Launching neural network training on 4 GPUs and 16 CPUs:
#!/bin/bash
#SBATCH --job-name=net_training
#SBATCH --error=stderr.%j.txt
#SBATCH --output=stdout.%j.txt
#SBATCH --partition=normal
#SBATCH --gpus=4
#SBATCH --cpus-per-task 16
# Activation of the Anaconda environment
# Before activating the environment, it is recommended to deactivate the current
source deactivate
source activate pytorch
echo "Working on node `hostname`"
echo "Assigned GPUs: $CUDA_VISIBLE_DEVICES"
cd /home/testuser/pytorch
export NGPUS=4
python -m torch.distributed.launch --nproc_per_node=$NGPUS /home/testuser/pytorch/train_net.py --config-file "configs/network_X_101_32x8d_FPN_1x.yaml"
Note for calculations in python with output to terminal
An important note for python calculations that show the results in the output:
By default python uses buffered output (i.e. the output is NOT instantly piped to the resulting file).
For example, in the case of code like this:
from time import sleep
print ('Start')
for i in range(60):
sleep(1)
print ('Done')
The initial "Start" output will only appear in the output file (slurm-<jobid>.out) after 60 iterations (along with "Done").
For tasks whose results are evaluated during execution (intermediate values, etc. are displayed), this is not suitable (for example, if the task ends by timeout, there will be no output in the file).
There are 2 options for instant display of results:
1. Add the argument flush=True to the print function: "print ('Start', flush=True)" (this only works with python version python >= v3.3)
2. Add the -u argument when starting the task to the interpreter itself: for example, "python3 -u myprog.py" (no change in the code are needed in this case)
Application launch
To run application software (Gromacs, Octave, BEAST, IQ-Tree, MATLAB and etc.), see examples of their use in the Application Software section of the HSE University Supercomputer.
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.