High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program. The RepeatMasker program was developed at Washington University by Adrian Smit.

RepeatMasker uses the Repbase libraries. Users should be aware of the Repbase academic license agreement before using RepeatMasker on the Helix Systems.

The input for RepeatMasker is a fasta-format sequence file. Multiple sequences can be contained within a single file. On Helix/Biowulf, RepeatMasker has been configured to have HMMER as the default search engine. Please contact staff@hpc.nih.gov if you have a particular need for a different search engine.

Web site

IMPORTANT NOTE:
By default, RepeatMasker will start 2 threads for every CPU in the node resulting in badly overloaded jobs. Users must prevent this behavior by setting -pa N where N is equal to the number of allocated CPUs divided by 2. In -pa N, "N" must always be greater than or equal to 2; -pa 1 causes RepeatMasker to run as though the -pa option was not specified and to start 2 threads for each CPU. See the examples below for details.

RepeatMasker On Helix
back to top

In this sample session, a sequence is obtained by using the EMBOSS seqret program. This sequence is then analyzed with RepeatMasker. Note that on Helix, users should not use more than 8 CPUs, and RepeatMasker starts 2 threads for every processor allocated with the -pa option. So uses should set -pa less than or equal to 4 on Helix.

[user@helix ~]$ module load emboss repeatmasker

************************************************************************
                     Welcome to EMBOSS 6.3.0
************************************************************************

Databases available: 
  genbank             Release 213 (14/Apr/16)
  ncbigp              NCBI genpept Rel 213 (14/Apr/16)
  refseqnt            Release 75 (14/Mar/16)
  PROSITE             Release 20.126 (11/May/16)
  Restriction Enzymes (REBASE) 605 (30/Apr/16)
  prints              Release 35_0 (23/Jul/02)
  uniprot             Release 2016_05 (11/May/16)
  allnt               including genbank,refseqnt,gbnew
  allaa               including uniprot, ncbigp, ncbigpnew
  gbnew               12/May/16, 480811 entries since 14/Apr/16 rel 213
  ncbigpnew           11/May/16, 2988151 entries since rel 213

     Type 'wossname keyword' to find a program
     Type 'showdb' to display available databases
     Type 'tfm programname' to display the program help
     Type 'programname -help' to list command-line options

     HELP! Helix Staff: 301-594-6248 or email: staff@hpc.nih.gov
*********************************************************************


[+] Loading repeatmasker 4.0.6 on helix.nih.gov
[user@helix ~]$ seqret
Read and write (return) sequences
Input (gapped) sequence(s): genbank:ay001401
Warning: ajBtreeIdentFetchHit called for cache '/spin1/db/embossdb/genbank.new/genbank.xid' with 1 reference files
output sequence(s) [ay001401.fasta]: 
[user@helix ~]$ repeatmasker -pa 4 ay001401.fasta 
RepeatMasker version open-4.0.6
Search Engine: HMMER [ 3.1b1 (May 2013) ]
Master RepeatMasker Database: /usr/local/apps/repeatmasker/4.0.6/Libraries/Dfam.hmm ( Complete Database: Dfam_2.0 )



analyzing file ay001401.fasta
identifying Simple Repeats in batch 1 of 1
identifying full-length ALUs in batch 1 of 1
identifying full-length interspersed repeats in batch 1 of 1
identifying remaining ALUs in batch 1 of 1
identifying most interspersed repeats in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done
[user@helix ~]$

Running a single RepeatMaster job on Biowulf
back to top

Set up a batch script along the following lines:

#!/bin/bash
# file called myjob.bat

module load repeatmasker
cd /data/mydir

par=$(($SLURM_JOB_CPUS_PER_NODE / 2))
if [ $par -lt 2 ]
then
    par=2
fi

repeatmasker -pa $par myfile.fasta

Note that this batch script automatically checks to see how many CPUs have been allocated (via $SLURM_JOB_CPUS_PER_NODE) and sets the -pa option to match. It will not set -pa less than 2, because -pa 1 instructs RepeatMasker to start 2 threads for each CPU it finds in the node.

Submit this job with:

[user@biowulf ~]$ sbatch myjob.bat

Running a swarm of RepeatMasker jobs on Biowulf
back to top

Set up a swarm command file containing one line for each of your RepeatMasker runs. Typically, only the input sequence name will change from line to line, but in the example below, different parameters are being applied to each sequence.

Sample swarm command file

# --------file myjobs.swarm----------
repeatmasker -pa 2 /data/username/file1.seq
repeatmasker -pa 2 /data/username/file2.seq
repeatmasker -pa 2 /data/username/file3.seq
....
repeatmasker -pa 2 /data/username/fileN.seq
-------------------------------------

Submit this set of runs to the batch system by typing

[user@biowulf ~]$ swarm --threads-per-process 4 --module repeatmasker myjobs.swarm

For details on using swarm see Swarm on Biowulf.

Documentation
back to top

RepeatMasker is extensively documented. To read the help docs on the HPC systems, either type repeatmasker without input or use the --help option.

Typing repeatmasker at the Biowulf prompt produces the following brief description of repeatmasker options. More information can be obtained by typing repeatmasker --help

[user@biowulf ~]$ repeatmasker

[user@biowulf ~]$ repeatmasker --help

This information can also be found online.