\\
Tips on running AlphaFold-related applications on Biowulf
Biowulf offers several AlphaFold-related applications, including Alphafold2, Alphafold3, ColabFold and AlphaPulldown, all of which have similar workflows and output files. This page provides brief guidance on selecting the appropriate application, running AlphaFold jobs efficiently and managing intermediate and result files effectively.
Choose the appropriate app
- Alphafold2: suitable for single protein or protein complex structures prediction with high accuracy, however, "AlphaFold is trained to predict the structure of proteins as they might appear in the PDB" (mostly for macromolecules with naturally occurring sequences), for limitations of Alphafold-related models (for example, mutants or disorded regions), see details on Alphafold faq
- ColabFold: accelerated AlphaFold2 implementation by replacing the original homology search with MMseqs2, a faster and more scalable algorithm. Is capable of batch processing with multiple sequences as input. Suitable for high throughput protein or protein complex structures prediction.
- Alphafold3: specialized for protein-DNA and protein-ligand complex structures prediction with high accuracy; has significantly higher antibody-antigen prediction accuracy. For limitations of Alphafold3, see details on Alphafold3 faq
- AlphaPulldown: A customized implementation of AlphaFold2-Multimer, designed to facilitate high-throughput screening and modeling of protein-protein interaction (PPIs).
Run jobs with two steps
It is strongly recommended to split AlphaFold-related jobs into two separate steps on Biowulf to maximize computing resource efficiency, particularly GPU usage. The first step—the Multiple Sequence Alignment (MSA) search—is CPU-intensive and does not require GPUs, while the subsequent model inference and structure prediction steps are GPU-intensive. By separating these stages, you can allocate resources more effectively and improve overall job efficiency.
- CPU based MSA Step: The MSA generation is CPU-intensive and should be ran on CPU nodes on Biowulf. See details on each app documentation page above for details. An Alphafold2 example.
- GPU based modeling Step: Use the MSA results generated in the first step as input for this stage. Run the model inference job on a single GPU, such as a100, v100x or v100 on Biowulf. Refer to the above documentation for each application for detailed instructions on submitting jobs to the GPU partition.
Clean up intermediate files
- Retain Only the Top-Scoring Structures: For each MSA input, multiple predicted structures (by default, five) will be generated. These structures are evaluated and ranked based on their confidence scores. The output pdb files are labeled as "rank_n" to indicate their ranking, except in the case of AlphaFold3, which generates a ranking_scores.csv file listing the rankings of predicted structures in cif format (instead of pdb). It is recommended to remove lower-scoring structures and their associated intermediate files (such as .pkl and .json files) to conserve storage space.
- Analyze .pkl Results Files: The model .pkl files in Alphafold2 (or .jason files in Alphafold3) contain extensive information about the predicted models and can be significantly larger, sometimes up to a thousand times than the corresponding .pdb files. These files are primarily used to generate figures such as the predicted Local Distance Difference Test (pLDDT) and Predicted Aligned Error (PAE). Consider deleting them after analysis, especially if you do not need to generate these figures.
- Reuse or Delete MSA Result Files: After structure prediction is complete, consider either reusing or deleting the MSA result files generated during the initial CPU-based MSA step. These files—including .msa and .sto alignment files, as well as .pkl feature files—can consume significant disk space (often several to dozens Mb each). Retaining only the necessary files will help optimize storage usage.