RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program.
famdb.py -i /fdb/dfam/current/Dfam.h5 names mammal
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task 4 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load repeatmasker [user@cn3144 ~]$ RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sequence.fasta # for any sequence file sequence.fasta RepeatMasker version open-4.0.7 Search Engine: HMMER [ 3.1b2 (February 2015) ] Master RepeatMasker Database: /usr/local/apps/repeatmasker/4.0.7/Libraries/Dfam.hmm ( Complete Database: Dfam_2.0 ) analyzing file sequence.fasta identifying Simple Repeats in batch 1 of 1 identifying full-length ALUs in batch 1 of 1 identifying full-length interspersed repeats in batch 1 of 1 identifying remaining ALUs in batch 1 of 1 identifying most interspersed repeats in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 processing output: cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 cycle 9 cycle 10 Generating output... masking done [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. repeatmasker.sh). For example:
#!/bin/sh set -e module load repeatmasker RepeatMasker -engine rmblast -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample.fasta
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=# [--mem=#] repeatmasker.sh
Create a swarmfile (e.g. repeatmasker.swarm). For example:
RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample1.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample2.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample3.fasta RepeatMasker -pa $(expr $SLURM_CPUS_PER_TASK / 4) sample4.fasta
Submit this job using the swarm command.
swarm -f repeatmasker.swarm [-g #] -t # --module repeatmaskerwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module repeatmasker | Loads the repeatmasker module for each subjob in the swarm |