Ouessant : Code execution in batch mode

Batch jobs are managed by the LSF software on all the nodes. They are distributed into “classes” principally in function of the elapsed time and the number of cores.

To submit a batch job from Ouessant, it is necessary to:

  • Create a submission script:

    job
    # Name of job
    #BSUB -J monJob_test
    # Output and error file
    #BSUB -e %J.err
    #BSUB -o %J.out
    # Number of MPI tasks
    #BSUB -n 4
    # Binding
    #BSUB -a p8aff(1,1,1,balance)
    # Number of gpu
    #BSUB -gpu "num=4:mode=exclusive_process:mps=no:j_exclusive=yes"
    # Number of MPI tasks per node
    #BSUB -R "span[ptile=4]"
    # Maximum duration of the job 
    #BSUB -W 01:00
    #BSUB -x
     
    module load xlf smpi
    # Command echoes and their output
    set -x
     
    mpirun /pwrlocal/ibmccmpl/bin/task_prolog.sh -devices auto ./a.out
  • Submit this script via the bsub command :

      bsub < job.sh

SMT mode

There are two POWER8 processors on each Ouessant node. Each processor has 10 cores which means there are 20 cores on each node.

The POWER8 processors have the (SMT) mode which allows having up to 8 logical cores per core. The SMT mode corresponds to the number of logical cores per core (SMTx1 mode, by default).

The use of several logical cores does not necessarily bring a performance gain.

Core affinity

The LSF p8aff application allows specifying the core affinity.

p8aff(<num_threads_per_task>,<smt>,<cpus_per_core>,<distribution_policy>)

The meanings of these arguments are:

Argument Value Meaning
<num_threads_per_task> 1-160 Number of threads per MPI task ( OMP_NUM_THREADS is set to this value)
<smt> 1,2,4 or 8 SMT mode
<cpu_per_core> 1-8 Number of logical cores in each physical core used by the threads
<distribution_policy> balance or pack MPI distribution policy

The distribution policy determines how the MPI processes are alloted to the cores.

  • pack : The processes are initially distributed to the first POWER8 processor and then, to the second POWER8 processor.
  • balance : The processes are distributed to both of the POWER8 processors in a balanced way.

The following examples are in SMTx1 mode with 4 MPI processes:

  • With the pack policy
Power 8 #1 Power 8 #2
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152
x x x x
  • With the balance policy
Power 8 #1 Power 8 #2
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152
x x x x

Configuration example

Execution configuration LSF directives
1 MPI #BSUB -n 1
#BSUB -R “span[ptile=1]”
#BSUB -a p8aff(1,1,1,balance)
1 MPIx20 Threads #BSUB -n 1
#BSUB -R “span[ptile=1]”
#BSUB -a p8aff(20,4,1,balance)
1 MPIx40 Threads #BSUB -n 1
#BSUB -R “span[ptile=1]”
#BSUB -a p8aff(40,4,2,balance)
2 MPIx10 Threads #BSUB -n 2
#BSUB -R “span[ptile=2]”
#BSUB -a p8aff(10,4,1,balance)
4 MPI #BSUB -n 4
#BSUB -R “span[ptile=4]”
#BSUB -a p8aff(1,1,1,balance)
4 MPI x 5 Threads #BSUB -n 4
#BSUB -R “span[ptile=4]”
#BSUB -a p8aff(5,4,1,balance)
20 MPI #BSUB -n 20
#BSUB -R “span[ptile=20]”
#BSUB -a p8aff(1,4,1,balance)
40 MPI #BSUB -n 40
#BSUB -R “span[ptile=40]”
#BSUB -a p8aff(1,2,1,balance)
80 MPI #BSUB -n 80
#BSUB -R “span[ptile=80]”
#BSUB -a p8aff(1,4,1,balance)

GPU selection

GPU selection is done via the BSUB directive:

#BSUB -gpu “num=<num_gpus>:mode={shared|exclusive_process}:mps={no|yes}:j_exclusive=yes”
  • num_gpus : Number of GPUs to use on the node (maximum 4).
  • If 1 GPU is reserved per MPI task, it is necessary to use mode=exclusive_process and mps=no.
  • If 1 GPU is used for several MPI tasks with mps, it is necessary to put mode=exclusive_process and mps=yes.
  • If 1 GPU is used for several MPI tasks without mps, it is necessary to put mode=shared and mps=no.
    (N.B. This method is not advised.)

To simplify binding the GPUs to processes, it is advised to use the script /pwrlocal/ibmccmpl/bin/task_prolog.sh as seen in the job submission script example (above).

CUDA-Aware MPI

By default, the use of MPI calls with GPU buffers has an undetermined behaviour. The -gpu option of the mpirun command allows activating the MPI CUDA-Aware version (only valid for Spectrum MPI); with this version, the GPU buffers can be used in the communications.