Ouessant : PowerAI

IBM PowerAI is available on Ouessant. This software stack allows rapidly implementing several A.I. frameworks within a Docker container launched in batch on the Ouessant nodes.

To use the PowerAI stack in batch, it is necessary to add the following directives to a standard submission script:

#BSUB -app docker

This is followed by launching the initialisation script of the framework which you wish to use:

source /opt/DL/<framework>/bin/<framework>-activate

ATTENTION : In order to set the necessary environment variables, it is now INDISPENSABLE to launch the command

module load powerai

in interactive BEFORE launching the command

bsub < monjob.sh

to submit your job.

In version 4.0 (by default), it is possible to launch calculations on the GPUs of several nodes, thanks to the IBM communication library, the Distributed Deep Learning Library (based on MPI). The versions of the different frameworks available in PowerAI v4.0 are as follows:

The preceding PowerAI version 3.4 is also available: This former version can be used to get around a bug which appeared in the new 4.0 version BUT it only allows calculating on the 4 GPUs of one node and not more. More information on the levels of the framework versions proposed in version 3.4 and their implementation is available on this page.

The following is a job canvas to customise in function of the framework which you wish to use:

job
# PowerAI
#BSUB -app powerai
#BSUB -env "all,~BASH_FUNC_ml(),~BASH_FUNC_module()"
# Name of job
#BSUB -J monJob
# Output and error file, the same file here
#BSUB -e %J.sortie
#BSUB -o %J.sortie
# Exclusive reservation of a node 
#BSUB -x
# Number of MPI tasks 
#BSUB -n 4
# Binding
#BSUB -a p8aff(1,1,1,balance)
# Number of gpu
#BSUB -R "rusage[ngpus_shared=4]"
# Number of MPI tasks per node
#BSUB -R "span[ptile=4]
# Maximum duration of job (1 minute)
#BSUB -W 01:00
 
set -x
source /opt/DL/<framework>/bin/<framework>-activate

For more information, with examples of submission scripts and GPU selection, please consult this page. The control commands of the jobs are described here.