Repliement de protéines sur Jean Zay

L’IDRIS propose plusieurs logiciels de repliement de protéines.

Conseils

L’utilisation d’Alphafold et Colabfold se fait en deux étapes :

  • Alignement de séquences multiples
  • Repliement de la protéine

L’alignement de séquences est assez long et n’est pas porté sur GPU. Il est préférable de le faire en dehors de la réservation GPU pour ne pas gâcher des heures de calculs.

Une possibilité est de le faire sur la partition pre-post, puis d’utiliser les résultats pour la phase de repliement.

Alphafold

Liens utiles

Versions disponibles

Version
2.3.1
2.2.4
2.1.2

Exemple de script de soumission

Alphafold 2.3.1

Monomer A100
alphafold-2.3.1-A100.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1            # Number of nodes
#SBATCH --ntasks-per-node=1  # Number of tasks per node
#SBATCH --cpus-per-task=8    # Number of OpenMP threads per task
#SBATCH --gpus-per-node=1    # Number of GPUs per node
#SBATCH -C a100              # Use A100 partition
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --job-name=alphafold # Jobname
#SBATCH --output=%x.o%j      # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j       # Error file
#SBATCH --time=10:00:00      # Expected runtime HH:MM:SS (max 20h)
##
## Please, refer to comments below for 
## more information about these 3 last options.
##SBATCH --account=<account>@a100      # To specify cpu accounting: <account> = echo $IDRPROJ
##SBATCH --partition=<partition>       # To specify partition (see IDRIS web site for more info)
##SBATCH --qos=qos_gpu-dev             # Uncomment for job requiring less than 2 hours                                                                                                                                                               
 
module purge
module load cpuarch/amd
module load alphafold/2.3.1
export TMP=$JOBSCRATCH
export TMPDIR=$JOBSCRATCH
 
fafile=test.fa
 
python3 $(which run_alphafold.py) \
    --output_dir=outputs_${fafile} \
    --uniref90_database_path=${ALPHAFOLDDB}/uniref90/uniref90.fasta \
    --mgnify_database_path=${ALPHAFOLDDB}/mgnify/mgy_clusters_2022_05.fa \
    --template_mmcif_dir=${ALPHAFOLDDB}/pdb_mmcif \
    --obsolete_pdbs_path=${ALPHAFOLDDB}/pdb_mmcif/obsolete.dat \
    --bfd_database_path=${ALPHAFOLDDB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --pdb70_database_path=${ALPHAFOLDDB}/pdb70/pdb70 \
    --uniref30_database_path=${ALPHAFOLDDB}/uniref30/UniRef30_2021_03 \
    --use_gpu_relax \
    --model_preset=monomer \
    --fasta_paths=${fafile} \
    --max_template_date=2022-01-01 \
    --data_dir=${ALPHAFOLDDB}/model_parameters/2.3.1

Pour un monomer

alphafold.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1            # Number of nodes
#SBATCH --ntasks-per-node=1  # Number of tasks per node
#SBATCH --cpus-per-task=10   # Number of OpenMP threads per task
#SBATCH --gpus-per-node=1    # Number of GPUs per node
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --job-name=alphafold # Jobname
#SBATCH --output=%x.o%j      # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j       # Error file
#SBATCH --time=10:00:00      # Expected runtime HH:MM:SS (max 100h)
##
## Please, refer to comments below for 
## more information about these 4 last options.
##SBATCH --account=<account>@v100       # To specify cpu accounting: <account> = echo $IDRPROJ
##SBATCH --partition=<partition>        # To specify partition (see IDRIS web site for more info)
##SBATCH --qos=qos_gpu-dev              # Uncomment for job requiring less than 2 hours
##SBATCH --qos=qos_gpu-t4               # Uncomment for job requiring more than 20h (max 16 GPUs)
 
module purge
module load alphafold/2.2.4
export TMP=$JOBSCRATCH
export TMPDIR=$JOBSCRATCH
 
## In this example we do not let the structures relax with OpenMM
 
python3 $(which run_alphafold.py) \
    --output_dir=outputs \
    --uniref90_database_path=${DSDIR}/AlphaFold/uniref90/uniref90.fasta \
    --mgnify_database_path=${DSDIR}/AlphaFold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=${DSDIR}/AlphaFold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=${DSDIR}/AlphaFold/pdb_mmcif/obsolete.dat \
    --bfd_database_path=${DSDIR}/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=${DSDIR}/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=${DSDIR}/AlphaFold/pdb70/pdb70 \
    --fasta_paths=test.fa \
    --max_template_date=2021-07-28 \
    --use_gpu_relax=False \
    --norun_relax \
    --data_dir=${DSDIR}/AlphaFold/model_parameters/2.2.4

Pour un multimer

Attention le fichier fasta doit contenir les différents monomers.

alphafold_multimer.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1            # Number of nodes
#SBATCH --ntasks-per-node=1  # Number of tasks per node
#SBATCH --cpus-per-task=10   # Number of OpenMP threads per task
#SBATCH --gpus-per-node=1    # Number of GPUs per node
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --job-name=alphafold # Jobname
#SBATCH --output=%x.o%j      # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j       # Error file
#SBATCH --time=10:00:00      # Expected runtime HH:MM:SS (max 100h for V100, 20h for A100)
##
## Please, refer to comments below for 
## more information about these 4 last options.
##SBATCH --account=<account>@v100       # To specify gpu accounting: <account> = echo $IDRPROJ
##SBATCH --partition=<partition>        # To specify partition (see IDRIS web site for more info)
##SBATCH --qos=qos_gpu-dev              # Uncomment for job requiring less than 2 hours
##SBATCH --qos=qos_gpu-t4               # Uncomment for job requiring more than 20h (max 16 GPUs, V100 only)
 
module purge
module load alphafold/2.2.4
export TMP=$JOBSCRATCH
export TMPDIR=$JOBSCRATCH
 
## In this example we let the structures relax with OpenMM
 
python3 $(which run_alphafold.py) \                                                                                                                                                                               
    --output_dir=outputs \
    --uniref90_database_path=${DSDIR}/AlphaFold/uniref90/uniref90.fasta \
    --mgnify_database_path=${DSDIR}/AlphaFold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=${DSDIR}/AlphaFold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=${DSDIR}/AlphaFold/pdb_mmcif/obsolete.dat \
    --bfd_database_path=${DSDIR}/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --pdb_seqres_database_path=${DSDIR}/AlphaFold/pdb_seqres/pdb_seqres.txt \
    --uniclust30_database_path=${DSDIR}/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --uniprot_database_path=${DSDIR}/AlphaFold/uniprot/uniprot.fasta \
    --use_gpu_relax \
    --model_preset=multimer \
    --fasta_paths=test.fasta \
    --max_template_date=2022-01-01 \
    --data_dir=${DSDIR}/AlphaFold/model_parameters/2.2.4

Colabfold

Liens utiles

Conseils pour la phase d’alignement

Le logiciel utilisé pour la phase d’alignement est MMSeqs. Celui-ci utilise une fonctionnalité pour lire les fichiers de la base de données qui est très inefficace sur le système de fichiers partagés Spectrum Scale présent à l'IDRIS.

Si vous avez un grand nombre de séquences à replier il est possible de copier la base de données en mémoire vive d’un noeud prepost pour accélérer les calculs. Ce n’est pas intéressant si vous avez moins de 20 séquences.

colab_align.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1                   # Number of nodes
#SBATCH --ntasks-per-node=1         # Number of tasks per node
#SBATCH --cpus-per-task=10          # Number of OpenMP threads per task
#SBATCH --hint=nomultithread        # Disable hyperthreading
#SBATCH --job-name=align_colabfold  # Jobname
#SBATCH --output=%x.o%j             # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j              # Error file
#SBATCH --time=10:00:00             # Expected runtime HH:MM:SS (max 20h)
#SBATCH --partition=prepost  
 
DS=$DSDIR/ColabFold
DB=/dev/shm/ColabFold
input=test.fa
 
mkdir $DB
cp $DS/colabfold_envdb_202108_aln* $DS/colabfold_envdb_202108_db.* $DS/colabfold_envdb_202108_db_aln.* $DS/colabfold_envdb_202108_db_h* $DS/colabfold_envdb_202108_db_seq* $DB
cp $DS/uniref30_2103_aln* $DS/uniref30_2103_db.* $DS/uniref30_2103_db_aln.* $DS/uniref30_2103_db_h* $DS/uniref30_2103_db_seq* $DB
cp $DS/*.tsv $DB
 
module purge
module load colabfold/1.3.0
colabfold_search ${input} ${DB} results

Exemple de script de soumission pour le repliement

colab_fold.slurm
#!/usr/bin/env bash 
#SBATCH --nodes=1 # Number of nodes 
#SBATCH --ntasks-per-node=1 # Number of tasks per node 
#SBATCH --cpus-per-task=10 # Number of OpenMP threads per task 
#SBATCH --gpus-per-node=1 # Number o
 
module purge 
module load colabfold/1.3.0 
export TMP=$JOBSCRATCH 
export TMPDIR=$JOBSCRATCH
 
## This script works if you generated the results folder with colabfold_search results results
## We do not advice to perform the alignment in the same job as the folding.
## The results of the folding will be stored in results_batch.
 
colabfold_batch --data=$DSDIR/ColabFold results results_batch