RepeatMasker on Biowulf

Quick Links

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program.

Documentation

Important Notes

Module Name: repeatmasker (see the modules page for more information)
⚠️ RepeatMasker's parallelization option (-pa/-parallel) does not refer to the maximum number of threads to use.⚠️
It rather corresponds to how many concurrent operations to run, each of which uses a fixed number of threads, depending on the aligner (At the time of writing: 4 for RMBlast; 2 for nhmmer. See RepeatMasker -help to confirm.) That is, if you run with -pa 32, your job may use up to 128 threads, resulting in severe overloading. Prevent this behavior by setting -pa N where N is equal to the number of allocated CPUs divided by 4 (RMBlast engine) or 2 (nhmmer engine). See the examples below for details.
Our RepeatMasker installation uses the Dfam database in conjunction with the final Repbase RepeatMasker edition libraries. Users should be aware of the Repbase academic license agreement before using RepeatMasker on the NIH HPC Systems.

The Repbase RepeatMasker Edition libraries are not updated past 2018, as the RepeatMasker team has shifted towards the Dfam database rather than maintain and reconcile the latest RepBase versions. See https://github.com/Dfam-consortium/RepeatMasker/issues/113#issuecomment-848973098 for a more detailed explanation.

On Biowulf, RepeatMasker has been configured to have RMBlast as the default search engine. HMMER can be used instead by passing -engine hmmer to the RepeatMasker command.
Setting a species/clade: RepeatMasker in v4.1.1 has switched to the FamDB format for the Dfam database. You may notice RepeatMasker being more strict with regards to what is acceptable for the -species flag. To check for valid names, you can query the database using famdb.py. See famdb.py --help for usage information and below for an example using our copy of the database:
famdb.py -i /fdb/dfam/current/Dfam.h5 names mammal
Additional RepeatMasker databases are available via the repeatmasker-lib module:

dfam-3.8-t2t-human-ape-extension_rmrb-20181026

RepBase RepeatMasker edition + Dfam 3.8 augmented with repeats found in T2T-CHM13 according to https://github.com/jessicaStorer88/RepeatMasker_library_CHM13

After loading the repeatmasker-lib module, RepeatMasker will use the library without any additional command line arguments necessary.

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --cpus-per-task 4
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load repeatmasker
[user@cn3144 ~]$ RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 2) sequence.fasta # for any sequence file sequence.fasta
RepeatMasker version open-4.0.7
Search Engine: HMMER [ 3.1b2 (February 2015) ]
Master RepeatMasker Database: /usr/local/apps/repeatmasker/4.0.7/Libraries/Dfam.hmm ( Complete Database: Dfam_2.0 )


analyzing file sequence.fasta
identifying Simple Repeats in batch 1 of 1
identifying full-length ALUs in batch 1 of 1
identifying full-length interspersed repeats in batch 1 of 1
identifying remaining ALUs in batch 1 of 1
identifying most interspersed repeats in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. repeatmasker.sh). For example:

#!/bin/sh
set -e
module load repeatmasker
RepeatMasker -engine rmblast -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample.fasta

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=# [--mem=#] repeatmasker.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. repeatmasker.swarm). For example:

RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample1.fasta
RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample2.fasta
RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample3.fasta
RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample4.fasta

Submit this job using the swarm command.

swarm -f repeatmasker.swarm [-g #] -t # --module repeatmasker

where

`-g #`	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
`-t #`	Number of threads/CPUs required for each process (1 line in the swarm command file).
`--module repeatmasker`	Loads the repeatmasker module for each subjob in the swarm