Skip to main content
⚠ INFORMATION
This page was translated by an AI (LLM) with a cursory human check and is awaiting full review.

Datasets and models available for AI in DSDIR

On Jean Zay, the DSDIR is a storage space dedicated to large datasets (in size and/or number of files) and widely used model collections that are necessary for the use of Artificial Intelligence tools. These datasets are public and visible to all Jean Zay users.

If you use such public datasets or models and they are not already in the DSDIR space, IDRIS will download and install them in this disk space. To do this, you can send your request to assist@idris.fr.

Public databases

Licence

IDRIS verifies that the databases present in the DSDIR are distributable, according to the terms of the associated licences. The use you make of the databases is then your responsibility and must also follow the terms of the licence**. You will find the licence of each database in the corresponding directory.

Datasets present on the HuggingFace Hub

Some datasets available on the HuggingFace Hub datasets page are already downloaded in the $DSDIR/HuggingFace/ directory.

To instantiate these datasets, the following lines of code are necessary:

import datasets, os
root_path = os.environ['DSDIR'] + '/HuggingFace'
dataset_name = nom_du_dataset
datatset_subset = nom_du_subset

dataset = datasets.load_from_disk(root_path + '/' + dataset_name + '/' + datatset_subset)

Search in the complete list of datasets on Jean Zay

To directly search for a dataset on Jean Zay and thus have access to an up-to-date list, you can use the following command by replacing search_string with the name (even partial) of the dataset you are looking for (about 20 seconds of execution):

find $DSDIR $DSDIR/HuggingFace/ -maxdepth 2 -path "$DSDIR/HuggingFace" -prune -o -type d -iname "*search_string*" -print

Public models (HuggingFace Hub)

A large number of models among the most downloaded from the HuggingFace Hub are already downloaded in the $DSDIR/HuggingFace_Models/ directory.

Licence

Most of the available models are subject to an open source licence. For more details on the terms of use of each of the models, you can refer to the source page of the model (whose link is in the source.txt file in the directory of each model) or to the list below. The licence associated with the model can be found in the tags at the top of the page.

The following file summarises some terms and conditions of the licences under which the models are published.

Using a model available on the DSDIR

The models are organised as follows:

  • they are located in the folder: $DSDIR/HuggingFace_Models/ (referred to as root below)
  • each model is located in the folder: root/model_name (e.g.: root/cross-encoder/ms-marco-MiniLM-L-12-v2/)

To load a model from the DSDIR, you need to use the from_pretrained function of the model you want to load (you need to import the tranformers library into your program):

  • transformers.AutoModel.from_pretrained(root+'/'+model_name) for a generic model
  • transformers.BertModel.from_pretrained(root+'/'+'bert_base_uncased') to load a specific model supported in the HuggingFace API.

Similarly, the tokenizers associated with each model are located in the model's folder. You also need to use the from_pretrained function of the desired tokenizer:

  • transformers.AutoTokenizer.from_pretrained(root+'/'+model_name) for a tokenizer associated with a generic model
  • transformers.BertTokenizer.from_pretrained(root+'/'+'bert_base_uncased') for a tokenizer associated with a model supported in the HuggingFace API.

Search in the complete list of models on Jean Zay

To directly search for a model on Jean Zay and thus have access to an up-to-date list, you can use the following command by replacing search_string with the name (even partial) of the model you are looking for or the author of the model:

find $DSDIR/HuggingFace_Models -maxdepth 2 -type d -iname '*search_string*'

Your opinion matters!

To give your feedback, report an error, or suggest an improvement, click here:

quick anonymous questionnaire

This questionnaire is temporary and will take less than a minute, so take the opportunity!