Parallelisation¶

The term "parallelization" functionally means performing parallel computing; i.e., simultaneously using multiple computer processors to solve a problem. In order for an application to benefit from parallel computing it must be developed explicitly with this goal in mind.

In high performance computing contexts, a "node" practically equates to a single computer (i.e., a processor with multiple processing cores, memory accessible to those processors, and a storage system). High-performance clusters can then be conceptually modeled as multiple nodes networked together. So, in terms of the number of nodes involved in any computation, one can distinguish between:

Single-node parallelism: In practice, a program that runs similarly to those on your local computer or a Virtual Machine (that might benefit from memory shared across multiple processors on the same computer).
Multi-node parallelism: A single program that runs multiple processes and can benefit from the distributed memory (i.e., memory made available across multiple nodes); MPI is one library that can enable such a use case. For more information about MPI on ScienceCluster please see the MPI section.

Note

Due to the composition of the UZH ScienceCluster network interconnects, we recommend all researchers with multi-node (CPU and multi-GPU based) workflows use the Supercomputing service. CSCS's platforms (Eiger for CPU-based and Daint for multi-GPU-based computations) provide the most up-to-date interconnects to ensure optimal multi-node performance.

Another special case of parallelism arises when the same program must be executed many times (dozens to millions of times) but with varied inputs. This is called an embarrasingly parallel workflow. The standard approach is to submit each execution as a separate independent job. But keeping track of these jobs is not easy for the user nor for the schedule system (slurm). For this use case there is a better way:

Job arrays: each job is submitted separately but the array of jobs can be managed as a whole

Please contact us if you would like assistance.

Single-node parallelism¶

This is the simplest case. You only need to request the number of CPUs that your application can efficiently use then execute the job on a single node via the option -N=1(or --nodes=1).

In most cases, there is a parameter (or a set of parameters) that you would need to specify when calling the application. The application's documentation should explain what parameter to use. You can find sample job scripts in the Job Submission section.

If you're paralleizing your own custom-written workflows, this means you will need to study all of the functions you use then determine where and how to apply parallelization across multiple processing cores (on the same node). Otherwise stated, for custom research workflows you should implement parallelization deliberately and with proper monitoring / benchmarking to ensure optimal efficiency.

Job arrays¶

In general, job arrays are useful for applying the same processing routine to a collection of multiple input data files or different sets of parameters to the same input data files. Job arrays offer a very simple way to submit a large number of independent processing jobs. In this example, the --array=1-16 option will cause 16 array tasks (numbered 1, 2, ..., 16) to be spawned when this master job script is submitted. The array tasks are simply copies of this master script that are automatically submitted to the scheduler on your behalf. Thus, exactly the same amount of resources is requested for each array task. However, in each array task an environment variable called SLURM_ARRAY_TASK_ID will be set to a unique value. In this example, the number will be in the range 1, 2, ..., 16. In your script, you can use this value to select, for example, a specific data file that each array task will be responsible for processing.

#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3850

your_application $SLURM_ARRAY_TASK_ID

Job array indices can be specified in a number of ways. For example:

A job array with index values between 0 and 31
```
#SBATCH --array=0-31
```
A job array with index values of 1, 2, 5, 19, 27
```
#SBATCH --array=1,2,5,19,27
```
A job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5, 7)
```
#SBATCH --array=1-7:2
```

In the example above, you can see that the output and error file names have special placeholders %A and %a. For each array tasks, they will be replaced with job ID and array task ID respectively.

Warning

Please make sure that each array task writes to its own unique set of files. For example, you can add SLURM_ARRAY_TASK_ID as a file name suffix or append it to the output directory name.

Below is a sample script that runs a parameter sweep with a hypothetical application. It iterates over two parameters: alpha and gamma. The first parameter takes values from 0 to 6 with a step of 2 (i.e., 0, 2, 4, 6). If a step of 1 is desired, the step value can be omitted; e.g., {0..10}. The second parameter can take values 'Aa', 'Bb', or 'Cc'. In total, there are 4 * 3 = 12 parameter combinations. So, the value for the --array parameter should be 0-11.

The nested loops generate two arrays that allow reconstruction of all parameter combinations. In this case, alphaArr contains 0 0 0 2 2 2 4... while gammaArr has Aa Bb Cc Aa Bb Cc Aa.... Later, $SLURM_ARRAY_TASK_ID is used to retrieve a particular combination. The array indices start with 0. Thus, when the $SLURM_ARRAY_TASK_ID variable is 2, for example, alpha is set to 0 and gamma takes the value Cc, which are subsequently passed to the myapp application.

The output file is saved to the results directory. To make the output file unique, both parameter values are added to its name. For example, when $SLURM_ARRAY_TASK_ID is 2, the output path will be results/output_a0_gCc.txt. Note the use of curly braces (e.g., {}) around the variable names. Since variable names can contain an underscore, without the braces bash would identify the first variable as $alpha_g. The curly braces around gamma are technically redundant but they can help to prevent a potential bug if another variable is added to the file name.

The back slash (i.e., \) is a line continuation character. There must be no spaces after back slashes.

Slurm exports several environment variables when it submits a job for execution. You may find SLURM_CPUS_PER_TASK particularly useful. In this example, the variable is used to specify the number of threads available to the application.

#!/usr/bin/bash -l
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=3850
#SBATCH --job-name=param_sweep
#SBATCH --output=param_sweep_%A_%a.out
#SBATCH --array=0-11

alphas=({0..6..2})
gammas=(Aa Bb Cc)

alphaArr=()
gammaArr=()
for alpha in "${alphas[@]}"; do
   for gamma in "${gammas[@]}"; do
      alphaArr+=($alpha)
      gammaArr+=($gamma)
   done
done

alpha=${alphaArr[$SLURM_ARRAY_TASK_ID]}
gamma=${gammaArr[$SLURM_ARRAY_TASK_ID]}
myapp \
   --alpha=$alpha \
   --gamma=$gamma \
   --threads=$SLURM_CPUS_PER_TASK \
   --output=results/output_a${alpha}_g${gamma}.txt

Before submitting the job, it would be helpful to test the script to ensure that it translates the task id correctly into the parameters. This can be done by setting $SLURM_ARRAY_TASK_ID to the first command line argument before its first use and adding echo before myapp.

# ...
SLURM_ARRAY_TASK_ID=$1
alpha=${alphaArr[$SLURM_ARRAY_TASK_ID]}
gamma=${gammaArr[$SLURM_ARRAY_TASK_ID]}
echo myapp \
   --alpha=$alpha \
   --gamma=$gamma \
   --threads=$SLURM_CPUS_PER_TASK \
   --output=results/output_a${alpha}_g${gamma}.txt

Suppose the job script has been saved as myjob and added execution permissions chmod u+x myjob. Now, when you call it with different ids it will print the application call string that would have been executed. Notice that you need to run the script directly without sbatch.

./myjob 2
./myjob 5
./myjob 9

Caution

Make sure you remove both changes before submitting the job!

MPI¶

Message Passing Interface (MPI) is a library of routines that can be used to create parallel programs. The MPI standard was developed to ameliorate interoperability problems between programming language constructs. It is a library that supplies commonly-available operating system services to create parallel processes and exchange information among these processes. MPI is typically utilized for multi-node parallelism but can also be used for single-node parallelism.

MPI is designed to allow users to create programs that can run efficiently on most parallel architectures. The design process included vendors (such as IBM, Intel, TMC, Cray, Convex, etc.), parallel library authors (involved in the development of PVM, Linda, etc.), and applications specialists. The final version for the draft standard became available in May of 1994.

Currently the following are installed: OpenMPI

OpenMPI and IntelMPI and cluster modules¶

By default ethernet optimized versions of MPI are loaded when loading an MPI module without setting a hardware constraint. This is because ethernet is available on all nodes in the cluster.

Here's an examples that loads an ethernet MPI variant:

module load dev2025a openmpi

Example MPI script¶

An MPI slurm script that uses 4 MPI tasks on 1 node, 2GB memory, and 2 CPU cores per task (assuming a "hybrid" MPI application that is designed to also use single-node parallelism):

#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00
#SBATCH --mem=2G  
#SBATCH --output=mpi_test.%j.out
#SBATCH --error=mpi_test.%j.err

module load dev2025a 
module load openmpi

## some settings for ScienceCluster 
export PMIX_MCA_psec=^munge
export PMIX_MCA_gds=hash

## ScienceCluster OpenMpi module is CUDA aware 
export OMPI_MCA_accelerator=^cuda
export OMPI_MCA_rcache=^gpusm,rgpusm
export OMPI_MCA_opal_cuda_support=0   ## set to 0 because this is a cpu-only job

## some optional checks:
echo "Running on node:"
scontrol show hostnames $SLURM_NODELIST
echo "Tasks: $SLURM_NTASKS"
srun hostname

srun ./hello-world   ## mpi hello world