RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program.
- Module Name: repeatmasker (see the modules page for more information)
- ⚠️
RepeatMasker's parallelization option (-pa/-parallel) does not refer to the maximum number of threads to use.⚠️
It rather corresponds to how many concurrent operations to run, each of which uses a fixed number of threads, depending on the aligner (At the time of writing: 4 for RMBlast; 2 for nhmmer. See RepeatMasker -help to confirm.) That is, if you run with -pa 32, your job may use up to 128 threads, resulting in severe overloading. Prevent this behavior by setting -pa N where N is equal to the number of allocated CPUs divided by 4 (RMBlast engine) or 2 (nhmmer engine). See the examples below for details.
- Our RepeatMasker installation uses the Dfam database in conjunction with the final Repbase RepeatMasker edition libraries. Users should be aware of the Repbase academic license agreement before using RepeatMasker on the NIH HPC Systems.
- The Repbase RepeatMasker Edition libraries are not updated past 2018, as the RepeatMasker team has shifted towards the Dfam database rather than maintain and reconcile the latest RepBase versions. See https://github.com/Dfam-consortium/RepeatMasker/issues/113#issuecomment-848973098 for a more detailed explanation.
- On Biowulf, RepeatMasker has been configured to have RMBlast as the default search engine. HMMER can be used instead by passing -engine hmmer to the RepeatMasker command.
- Setting a species/clade: RepeatMasker in v4.1.1 has switched to the FamDB format for the Dfam database. You may notice RepeatMasker being more strict with regards to what is acceptable for the -species flag. To check for valid names, you can query the database using famdb.py. See famdb.py --help for usage information and below for an example using our copy of the database:
famdb.py -i /fdb/dfam/current/Dfam.h5 names mammal
- Additional RepeatMasker databases are available via the repeatmasker-lib module:
- dfam-3.8-t2t-human-ape-extension_rmrb-20181026
- RepBase RepeatMasker edition + Dfam 3.8 augmented with repeats found in T2T-CHM13 according to https://github.com/jessicaStorer88/RepeatMasker_library_CHM13
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task 4 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load repeatmasker [user@cn3144 ~]$ RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 2) sequence.fasta # for any sequence file sequence.fasta RepeatMasker version open-4.0.7 Search Engine: HMMER [ 3.1b2 (February 2015) ] Master RepeatMasker Database: /usr/local/apps/repeatmasker/4.0.7/Libraries/Dfam.hmm ( Complete Database: Dfam_2.0 ) analyzing file sequence.fasta identifying Simple Repeats in batch 1 of 1 identifying full-length ALUs in batch 1 of 1 identifying full-length interspersed repeats in batch 1 of 1 identifying remaining ALUs in batch 1 of 1 identifying most interspersed repeats in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 processing output: cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 cycle 9 cycle 10 Generating output... masking done [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. repeatmasker.sh). For example:
#!/bin/sh set -e module load repeatmasker RepeatMasker -engine rmblast -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample.fasta
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=# [--mem=#] repeatmasker.sh
Create a swarmfile (e.g. repeatmasker.swarm). For example:
RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample1.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample2.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample3.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample4.fasta
Submit this job using the swarm command.
swarm -f repeatmasker.swarm [-g #] -t # --module repeatmaskerwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module repeatmasker | Loads the repeatmasker module for each subjob in the swarm |