High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
SoftSearch on Biowulf & Helix

SoftSearch is used for discovering structural variant breakpoints in Illumina paired-end next-generation sequencing data. SoftSearch combines multiple strategies for detecting SV including split-read, discordant read-pair, and unmated pairs. Co-localized split-reads and discordant read pairs are used to refine the breakpoints.

Reference: SoftSearch: Integration of Multiple Sequence Features to Identify Breakpoints of Structural Variations. Steven N. Hart, Vivekananda Sarangi, Raymond Moore, Saurabh Baheti, Jaysheel D. Bhavsar, Fergus J. Couch, Jean-Pierre A. Kocher. PloS One. (2013).

There may be multiple versions of SoftSearch available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail SoftSearch

To select a module use

module load SoftSearch/[version]

where [version] is the version of choice.

Note that most genome reference data is available in /fdb/igenomes/

On Helix

Sample session:

[user@helix ~]$ module load SoftSearch
[+] Loading samtools 1.3.1 ...
[+] Loading SoftSearch 2_2 ...

[user@helix ~]$ SoftSearch -b KM12.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
Start Time : 2016-05-25 15:18:40
Writing results to 6_42VYGAAXX.vcf
Usage = SoftSearch.pl -l 5 -q 20 -r 5 -d 300 -m 5 -s 6 -c  -b KM12.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o 6_42VYGAAXX.vcf

The mean insert size is 131.332 +/- 61.5167 (sd)
The upper limit = 500
The lower limit = -237

Making SAM file of soft-clipped reads
Identifying soft-clipped regions that are at least 5 bp long
Calling sides of soft-clips
Making BAM file that has discordant read pairs

Note that SoftSearch can produce fairly large temp files of a few GB each. You may want to use the '-t no' flag to Softsearch to clear up these temp files.

Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is SoftSearch.sh

module load SoftSearch
cd /data/$USER/mydir

SoftSearch.pl -b KM12.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

And submit to the queue with sbatch:

biowulf$ sbatch --mem=5g SoftSearch.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is SoftSearch.swarm
SoftSearch.pl -b /data/CloudData/sean/bam/KM12.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
SoftSearch.pl -b /data/CloudData/sean/bam/2_7035GAAXX.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
SoftSearch.pl -b /data/CloudData/sean/bam/UO-31.bam -f /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

And submit to the queue with swarm

biowulf$ swarm -g 5 -f SoftSearch.swarm --module SoftSearch/1.0

Softsearch website

Available Perl scripts in the SoftSearch package include:

Typing the name of a SoftSearch perl script with no arguments will get you a Usage line. e.g.
helix% SoftSearch.pl

usage: SoftSearch.pl [-cqlrmsd] -b <BAM> -f <Genome.fa>
-q	Minimum mapping quality [20]
-l	Minimum length of soft-clipped segment [5]
-r	Minimum depth of soft-clipped reads at position [5]
-m	Minimum number of discordant read pairs [5]
-s	Number of sd away from mean to be considered discordant [6]
-u	Number of unmated pairs [0]
-d	Minimum distance between soft-clipped segments and
	discordant read pairs [300]
-o	Output file name [output.vcf]
-t	Print temp files for debugging [no|yes]
-c	use only this chrom
-p	use paired-end mode only [no|yes]
-g	disable paired-only seach [no|yes]

-a	set the minimum distance for a discordant read pair without soft-clipping info [10000]
-e	disable strict quality filtering of base qualities in soft-clipped reads [no]