High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed

Population-scale detection of novel sequence insertions

A method for discovering and genotyping novel sequence insertions. PopIns takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions.

Web site

PopIns On Helix
back to top

Test data have been copied to /usr/local/apps/popins/05b1dd1/testdata. In this example, a user copies these data to their own data directory and then runs an example popins command. (User input in bold)

[user@helix ~]$ mkdir /data/$USER/popins_test

[user@helix ~]$ cd /data/$USER/popins_test

[user@helix popins_test]$ cp /usr/local/apps/popins/05b1dd1/testdata/* .

[user@helix popins_test]$ module load popins
[+] Loading popins 05b1dd1 on helix.nih.gov

[user@helix popins_test]$ popins --help
PopIns - population-scale detection of novel sequence insertions

    popins COMMAND [OPTIONS]

    assemble   Crop unmapped reads from a bam file and assemble them.
    merge      Merge contigs from assemblies of unmapped reads into supercontigs.
    contigmap  Map unmapped reads to (super-)contigs.
    place      Find position of (super-)contigs in the reference genome.
    genotype   Genotype insertions for an individual.

    damp_v1-32-g05b1dd1, Date: Fri Sep 11 11:14:43 2015 +0000

Try `popins COMMAND --help' for more information on each command.

[user@helix popins_test]$ popins assemble --help
popins assemble - Assembly of unmapped reads.

    popins assemble [OPTIONS] BAM FILE

    Finds the unmapped reads in a bam files. If a fasta file is specified, the unmapped reads will first be remapped
    to this reference using bwa and only reads that remain unmapped are further processed. All unmapped reads are
    quality filtered using sickle and passed to assembly with velvet.

    -h, --help
          Displays this help message.
          Display version information
    -d, --directory PATH
          Path to working directory. Default: current directory.
    -tmp, --tmpdir PATH
          Path to a temporary directory ending with XXXXXX. Default: same as working directory.
    -k, --kmerLength INT
          The k-mer size for velvet assembly. Default: 47.
    -a, --adapters STR
          Enable adapter removal for Illumina reads. Default: no adapter removal. One of HiSeq and HiSeqX.
    -r, --reference FILE
          Fasta file with reference sequences for remapping. Default: no remapping. Valid filetypes are: fa, fna, and
    -f, --filter INT
          Consider reads with low quality alignments as unmapped only for first INT sequences in the reference file.
          Requires reference file for remapping to be set.
    -t, --threads INT
          Number of threads to use for bwa. In range [1..inf]. Default: 1.
    -m, --memory STR
          Maximum memory for samtools sort. Default: 500000000.

    popins assemble version: damp_v1-32-g05b1dd1
    Last update Fri Sep 11 11:14:43 2015 +0000
[user@helix popins_test]$ popins assemble --directory . --tmpdir tmp_XXXXXX --threads 8 test.bam > popins.log 2>&1
To see the contents of popins.log, click here.

Running a single PopIns job on Biowulf
back to top

Set up a batch script along the following lines:

# file called myjob.bat

module load popins
cd /lscratch/$SLURM_JOBID
popins assemble --directory /data/$USER/data_dir --tmpdir tmp_XXXXXX --threads 8 /data/$USER/data_dir/test.bam

Submit this job with:

[user@biowulf ~]$ sbatch --cpus-per-task=8 --gres=lscratch:5 myjob.bat

Note that --cpus-per-task=N where N is equal to the --threads argument in the job script. Note also the use of the generic resource lscratch for temporary disk space. For more information on submitting jobs to slurm, see Job Submission in the Biowulf User Guide.

Running a swarm of PopIns jobs on Biowulf
back to top

Sample swarm command file

# --------file myjobs.swarm----------
popins assemble --directory /data/$USER/data_dir --tmpdir tmp_XXXXXX --threads 8 /data/$USER/data_dir/file1.bam
popins assemble --directory /data/$USER/data_dir --tmpdir tmp_XXXXXX --threads 8 /data/$USER/data_dir/file2.bam
popins assemble --directory /data/$USER/data_dir --tmpdir tmp_XXXXXX --threads 8 /data/$USER/data_dir/file3.bam
popins assemble --directory /data/$USER/data_dir --tmpdir tmp_XXXXXX --threads 8 /data/$USER/data_dir/fileN.bam
# -----------------------------------

Submit this set of runs to the batch system by typing

[user@biowulf ~]$ swarm --module popins --threads-per-process 8 --gres lscratch:5 -f myjobs.swarm

For details on using swarm see Swarm on Biowulf.

back to top

To read the help doc, type popins --helph. You can get help for specific commands with popins COMMAND --help See also: