Protein folding on Jean Zay

IDRIS has several software for protein folding.

Advice

Alphafold and Colabfold use two steps:

  • Multiple sequence alignment
  • Protein folding

The alignment step is quite long and not available on GPU. Therefore it is recommended to do this step outside of a GPU job not to waste hours.

You can use the prepost partition for this step and then use the results for the folding.

Alphafold

Available versions

Version
2.2.4
2.1.2

Submission script example

For a monomer

alphafold.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1            # Number of nodes
#SBATCH --ntasks-per-node=1  # Number of tasks per node
#SBATCH --cpus-per-task=10   # Number of OpenMP threads per task
#SBATCH --gpus-per-node=1    # Number of GPUs per node
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --job-name=alphafold # Jobname
#SBATCH --output=%x.o%j      # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j       # Error file
#SBATCH --time=10:00:00      # Expected runtime HH:MM:SS (max 100h for V100, 20h for A100)
##
## Please, refer to comments below for 
## more information about these 4 last options.
##SBATCH --account=<account>@gpu       # To specify gpu accounting: <account> = echo $IDRPROJ
##SBATCH --partition=<partition>       # To specify partition (see IDRIS web site for more info)
##SBATCH --qos=qos_gpu-dev      # Uncomment for job requiring less than 2 hours
##SBATCH --qos=qos_gpu-t4      # Uncomment for job requiring more than 20h (max 16 GPU, V100 only)
 
module purge
module load alphafold/2.2.4
export TMP=$JOBSCRATCH
export TMPDIR=$JOBSCRATCH
 
## In this example we do not let the structures relax with OpenMM
 
python3 $(which run_alphafold.py) \
    --output_dir=outputs \
    --uniref90_database_path=${DSDIR}/AlphaFold/uniref90/uniref90.fasta \
    --mgnify_database_path=${DSDIR}/AlphaFold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=${DSDIR}/AlphaFold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=${DSDIR}/AlphaFold/pdb_mmcif/obsolete.dat \
    --bfd_database_path=${DSDIR}/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=${DSDIR}/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=${DSDIR}/AlphaFold/pdb70/pdb70 \
    --fasta_paths=test.fa \
    --max_template_date=2021-07-28 \
    --use_gpu_relax=False \
    --norun_relax \
    --data_dir=${DSDIR}/AlphaFold/model_parameters/2.2.4

For a multimer

Attention: the fasta file must contain the different monomers.

alphafold_multimer.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1            # Number of nodes
#SBATCH --ntasks-per-node=1  # Number of tasks per node
#SBATCH --cpus-per-task=10   # Number of OpenMP threads per task
#SBATCH --gpus-per-node=1    # Number of GPUs per node
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --job-name=alphafold # Jobname
#SBATCH --output=%x.o%j      # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j       # Error file
#SBATCH --time=10:00:00      # Expected runtime HH:MM:SS (max 100h for V100, 20h for A100)
##
## Please, refer to comments below for 
## more information about these 4 last options.
##SBATCH --account=<account>@gpu       # To specify cpu accounting: <account> = echo $IDRPROJ
##SBATCH --partition=<partition>       # To specify partition (see IDRIS web site for more info)
##SBATCH --qos=qos_gpu-dev      # Uncomment for job requiring less than 2 hours
##SBATCH --qos=qos_gpu-t4      # Uncomment for job requiring more than 20h (max 16 GPU, V100 only)
 
module purge
module load alphafold/2.2.4
export TMP=$JOBSCRATCH
export TMPDIR=$JOBSCRATCH
 
## In this example we let the structures relax with OpenMM
 
python3 $(which run_alphafold.py) \                                                                                                                                                                               
    --output_dir=outputs \
    --uniref90_database_path=${DSDIR}/AlphaFold/uniref90/uniref90.fasta \
    --mgnify_database_path=${DSDIR}/AlphaFold/mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=${DSDIR}/AlphaFold/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=${DSDIR}/AlphaFold/pdb_mmcif/obsolete.dat \
    --bfd_database_path=${DSDIR}/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --pdb_seqres_database_path=${DSDIR}/AlphaFold/pdb_seqres/pdb_seqres.txt \
    --uniclust30_database_path=${DSDIR}/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --uniprot_database_path=${DSDIR}/AlphaFold/uniprot/uniprot.fasta \
    --use_gpu_relax \
    --model_preset=multimer \
    --fasta_paths=test.fasta \
    --max_template_date=2022-01-01 \
    --data_dir=${DSDIR}/AlphaFold/model_parameters/2.2.4

Colabfold

Useful links

Advice for the alignment

Alignments are done with MMSeqs. The software uses a features for reading files which is very inefficient on Spectrum Scale, the network file system of Jean Zay.

If you have a large number of sequences to align it is possible to copy the database in memory on a prepost node. It is not recommended if you do not have a large number of sequences since the copy in memory can be quite long.

colab_align.slurm
#!/usr/bin/env bash
#SBATCH --nodes=1                   # Number of nodes
#SBATCH --ntasks-per-node=1         # Number of tasks per node
#SBATCH --cpus-per-task=10          # Number of OpenMP threads per task
#SBATCH --hint=nomultithread        # Disable hyperthreading
#SBATCH --job-name=align_colabfold  # Jobname
#SBATCH --output=%x.o%j             # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x.o%j              # Error file
#SBATCH --time=10:00:00             # Expected runtime HH:MM:SS (max 20h)
#SBATCH --partition=prepost  
 
DS=$DSDIR/ColabFold
DB=/dev/shm/ColabFold
input=test.fa
 
mkdir $DB
cp $DS/colabfold_envdb_202108_aln* $DS/colabfold_envdb_202108_db.* $DS/colabfold_envdb_202108_db_aln.* $DS/colabfold_envdb_202108_db_h* $DS/colabfold_envdb_202108_db_seq* $DB
cp $DS/uniref30_2103_aln* $DS/uniref30_2103_db.* $DS/uniref30_2103_db_aln.* $DS/uniref30_2103_db_h* $DS/uniref30_2103_db_seq* $DB
cp $DS/*.tsv $DB
 
module purge
module load colabfold/1.3.0
colabfold_search ${input} ${DB} results

Exemple de script de soumission pour le repliement

colab_fold.slurm
#!/usr/bin/env bash 
#SBATCH --nodes=1 # Number of nodes 
#SBATCH --ntasks-per-node=1 # Number of tasks per node 
#SBATCH --cpus-per-task=10 # Number of OpenMP threads per task 
#SBATCH --gpus-per-node=1 # Number o
 
module purge 
module load colabfold/1.3.0 
export TMP=$JOBSCRATCH 
export TMPDIR=$JOBSCRATCH
 
## This script works if you generated the results folder with colabfold_search results results
## We do not advice to perform the alignment in the same job as the folding.
## The results of the folding will be stored in results_batch.
 
colabfold_batch --data=$DSDIR/ColabFold results results_batch