Jean Zay: The disk spaces

There are four distinct disk spaces accessible for each project: HOME, WORK, SCRATCH/JOBSCRATCH and the STORE.

Each space has specific characteristics suitable to its usage which are described below. The paths to access these spaces are stocked in five variables of the shell environment: $HOME, $WORK, $SCRATCH, $JOBSCRATCH and $STORE.

You can know the occupation of the different disk spaces by using the IDRIS “ idr_quota” commands or with the Unix du (disk usage) command. The return of the “idr_quota” commands is immediate but is not information in real time (the idr_quota_user and idr_quota_project commands are updated once a day and the idrquota command is updated every 30 minutes). The du command returns information in real time but its execution can be long, depending on the size of the concerned directory.

The HOME

$HOME : This is the home directory during an interactive connection. This space is intended for frequently-used small-sized files such as the shell environment files, the tools, and potentially the sources and libraries if they have a reasonable size. The size of this space is limited (in space and in number of files).
The HOME characteristics are:

  • It is a permanent space.
  • It is saved via snapshots: See the section entitled The snapshots below.
  • Intended to receive small-sized files.
  • In the case of a multi-project login, the HOME is unique.
  • Submitted to quotas per user which are intentionally rather low (3 GiB by default).
  • Accessible in interactive or in a batch job via the $HOME variable :
    $ cd $HOME
  • It is the home directory during an interactive connection.

Note: The HOME space is also referenced via the CCFRHOME environment variable to respect a common nomenclature with other national computing centers (CINES, TGCC)

$ cd $CCFRHOME

The WORK

$WORK: This is a permanent work and storage space which is usable in batch. In this space, we generally store large-sized files for use during batch executions: very large source files, libraries, data files, executable files, result files and submission scripts.
The characteristics of WORK are:

  • It is a permanent space.
  • It is saved via snapshots: See the section entitled The snapshots below.
  • Intended to receive large-sized files.
  • In the case of a multi-project login, a WORK is created for each project.
  • Submitted to quotas per project.
  • It is accessible in interactive or in a batch job.
  • It is composed of 2 sections:
    • A section in which each user has an individual part, accessed by the command:
      $ cd $WORK
    • A section common to the project to which the user belongs and into which files to be shared can be placed, accessed by the command:
      $ cd $ALL_CCFRWORK
  • The WORK is a GPFS disk space with a bandwidth of about 100 GB/s in read and in write. This bandwidth can be temporarily saturated in case of exceptionally intensive usage.

Note: The WORK space is also referenced via the CCFRWORK environment variable to respect a common nomenclature with other national computing centers (CINES, TGCC)

$ cd $CCFRWORK

Usage recommendations

  • Batch jobs can run in the WORK. Nevertheless, because several of your jobs can be run at the same time, you must manage the unique identities of your execution directories or your file names.
  • Moreover, this disk space is submitted to quotas (per project) which can suddenly stop your execution if the quotas are reached. Therefore, in the WORK, you must not only be aware of your own activity but also that of your project colleagues. For these reasons, you may prefer using the SCRATCH or the JOBSCRATCH for the execution of batch jobs.

The SCRATCH/JOBSCRATCH

$SCRATCH : This is a semi-permanent work and storage space which is usable in batch; the lifespan of the files is limited to 30 days. The large-sized files used during batch executions are generally stored here: the data files, result files or the computation restarts. Once the post-processing has been done to reduce the data volume, you must remember to copy the significant files into the WORK so that they are not lost after 30 days, or into the STORE for long-term archiving.
The characteristics of the SCRATCH are:

  • The SCRATCH is a semi-permanent space with a 30-day file lifespan.
  • It is not backed up.
  • It is intended to receive large-sized files.
  • It is submitted to very large security quotas:
    • disk quotas per project, about 1/10th of the total disk space for each group
    • and inode quotas per project, about 150 million files and directories.
  • It is accessible in interactive or in a batch job.
  • It is composed of 2 sections:
    • A section in which each user has an individual part; accessed by the command:
      $ cd $SCRATCH
    • A section common to the project to which the user belongs into which files to be shared can be placed. It is accessed by the command:
      $ cd $ALL_CCFRSCRATCH
  • In the case of a multi-project login, a SCRATCH is created for each project.
  • The SCRATCH is a GPFS disk space with a bandwidth of about 500 GB/s in write and in read.

Note: The SCRATCH space is also referenced via the CCFRSCRATCH environment variable to respect a common nomenclature with other national computing centers (CINES, TGCC)

$ cd $CCFRSCRATCH

$JOBSCRATCH: This is the temporary execution directory specific to batch jobs.
Its characteristics are:

  • It is a temporary directory with file lifespan equivalent to the batch job lifespan.
  • It is not backed up.
  • It is intended to receive large-sized files.
  • It is submitted to very large security quotas:
    • disk quotas per project, about 1/10th of the total disk space for each group
    • and inode quotas per project, about 150 million files and directories.
  • It is created automatically when a batch job starts and, therefore, is unique to each job.
  • It is destroyed automatically at the end of the job. Therefore, it is necessary to manually copy the important files onto another disk space (the WORK or the SCRATCH) before the end of the job.
  • The JOBSCRATCH is a GPFS disk space with a bandwidth of about 500 GB/s in write and in read.
  • During the execution of a batch job, the corresponding JOBSCRATCH is accessible from the Jean Zay front end via its JOBID job number (see the output of the squeue command) and the following command:
    $ cd /gpfsssd/jobscratch/JOBID

Usage recommendations:

  • The JOBSCRATCH can be seen as the former TMPDIR.
  • The SCRATCH can be seen as a semi-permanent WORK which offers the maximum input/output performance available at IDRIS but limited by a 30-day file lifespan.
  • The semi-permanent characteristics of the SCRATCH allow storing large volumes of data there between two or more jobs which run successively, one right after another, but within a limited period of a few weeks: This disk space is not purged after each job.

The STORE

$STORE: This is the IDRIS archiving space for long-term storage. Very large-sized files are generally stored there, consequent to using tar for a tree hierarchy of compute result files after post-processing. This is a space which is not meant to be accessed or modified on a daily basis but to preserve very large volumes of data over time with only occasional consultation.
Its characteristics are:

  • The STORE is a permanent space.
  • It is not backed up .
  • We advise against systematically accessing it in write during a batch job.
  • It is intended to received very large-sized files: The maximum size is 10 TiB per file and the minimum recommended size is 250 MiB (ratio disc size/ number of inodes).
  • In the case of a multi-project login, a STORE is created per project.
  • It is submitted to quotas per project with a small number of inodes, but a very large space.
  • It is composed of 2 sections:
    • A section in which each user has an individual part, accessed by the command:
      $ cd $STORE
    • A section common to the project to which the user belongs and into which files to be shared can be placed. It is accessed by the command:
      $ cd $ALL_CCFRSTORE

Note: The STORE space is also referenced via the CCFRSTORE environment variable to respect a common nomenclature with other national computing centers (CINES, TGCC)

$ cd $CCFRSTORE

Usage recommendations:

  • The STORE can be seen as replacing the former Ergon archive server.
  • However, there is no longer a limitation on file lifespan.
  • As this is an archive space, it is not intended for frequent access.

The DSDIR

$DSDIR: This is a storage space dedicated to voluminous public data bases (in size or number of files) which are needed for using AI tools. These datasets are visible to all Jean Zay users.

If you use large public data bases which are not found in the $DSDIR space, IDRIS will download and install them in this disk space at your request.

The list of currently accessible data bases is found on this page: Jean Zay: Datasets and models available in the $DSDIR storage space.

If your database is personal or under a license which is too restrictive, you must take charge of its management yourself in the disk spaces of your project, as described on the Database Management page.

Summary table of the main disk spaces

Space Default capacity Features Usage
$HOME 3GB and 150k inodes
per user
- Home directory at connection
- Backed up space
- Storage of configuration files and small files
$WORK 5TB (*) and 500k inodes
per project
- Storage on rotating disks
(100GB/s read/write operations)
- Backed up space
- Storage of source codes and input/output data
- Execution in batch or interactive
$SCRATCH Very large security quotas
2.5PB shared by all users
- SSD Storage
(500GB/s read/write operations)
- Lifespan of unused files
(= not read or modified): 30 days
- Space not backed up
- Storage of voluminous input/output data
- Execution in batch or interactive
- Optimal performance for read/write operations
$STORE 50TB (*) and 100k inodes (*)
per project
- Space not backed up - Long-term archive storage (for lifespan of project)
(*) Quotas per project can be increased at the request of the project manager or deputy manager via the Extranet interface, or per request to the user support team.

The snapshots

The $HOME and $WORK are saved regularly via a snapshot mechanism: These are snapshots of the tree hierarchies which allow you to recover a file or a directory that you have corrupted or deleted by error.

All the available snapshots SNAP_YYYYMMDD, where YYYYMMDD correspond to the backup date, are visible from all the directories of your HOME and your WORK via the following command:

$ ls .snapshots
SNAP_20191022  SNAP_20191220  SNAP_20200303  SNAP_20200511
SNAP_20191112  SNAP_20200127  SNAP_20200406  SNAP_20200609 

Comment: In this example, you can see 8 backups. To recover a file from 9 June 2020, you simply need to select the directory SNAP_20200609.

Important: The .snapshots directory is not visible with the ls -a command so don't be surprised when you don't see it. Only its contents can be consulted.

For example, if you wish to recover a file which was in the $WORK/MY_DIR subdirectory, you just need to follow the procedure below:

  1. Go into the directory of the initial file:
    $ cd $WORK/MY_DIR
  2. You will find the backup which interests you via the ls command:
    $ ls .snapshots
    SNAP_20191022  SNAP_20191220  SNAP_20200303  SNAP_20200511
    SNAP_20191112  SNAP_20200127  SNAP_20200406  SNAP_20200609 
  3. You can then see the contents of your $WORK/MY_DIR directory as it was on 9 June 2020, for example, with the command:
    $ ls -al .snapshots/SNAP_20200609 
    total 2
    drwx--S--- 2 login  prj  4096 oct.  24  2019 .
    dr-xr-xr-x 2 root  root 16384 janv.  1  1970 ..
    -rw------- 1 login  prj 20480 oct.  24  2019 my_file 
  4. Finally, you can recover the file as it was on the date of 9 June 2020 by using the cp command:
    1. By overwriting the initial file, $WORK/MY_DIR/my_file (note the “.” at the end of the command):
      $ cp .snapshots/SNAP_20200609/my_file . 
    2. Or, by renaming the copy as $WORK/MY_DIR/my_file_20200609 in order to not overwrite the initial file $WORK/MY_DIR/my_file:
      $ cp .snapshots/SNAP_20200609/my_file  my_file_20200609 

Comments:

  • The ls -l .snapshots/SNAP_YYYYMMDD command always indicates the contents of the directory where you are but on the given date YYYY/MM/DD.
  • You can add the -p option to the cp command in order to keep the date and the Unix access rights of the recovered file:
    $ cp -p .snapshots/SNAP_20200609/my_file . 
    $ cp -p .snapshots/SNAP_20200609/my_file  my_file_20200609 
  • Files are recovered from your HOME by using the same procedure.