Biowulf High Performance Computing at the NIH
pychopper on Biowulf

Pychopper is used to identify, orient and trim full-length Nanopore cDNA reads. The tool is also able to rescue fused reads.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=6 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load pychopper
[user@cn3144]$ zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq
[user@cn3144]$ cdna_classifier.py -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \
                    -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Total fastq records in input file: 25000
Tuning the cutoff parameter (q) on 9834 sampled reads (40.0%).
Optimizing over 30 cutoff values.
100%|██████████████████████████████████| 30/30 [04:53<00:00,  9.79s/it]
Best cutoff (q) value is 1.0345 with 92% of the reads classified.
Processing the whole dataset using a batch size of 3125:
100%|██████████████████████████████████| 25000/25000 [00:28<00:00, 884.24it/s]

Move the rescuted and unclassified reads and the reports if you need them before ending the session.

[user@cn3144]$ gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz
[user@cn3144]$ gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz
[user@cn3144]$ mv report.pdf /data/$USER/temp
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. pychopper.sh), which uses the input file 'pychopper.in'. For example:

#!/bin/bash
module load pychopper/2.0.3 || exit 1
cd /lscratch/$SLURM_JOB_ID
zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq
cdna_classifier.py -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \
    -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz
gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz
gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz
mv report.pdf /data/$USER/temp

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=10g pychopper.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. pychopper.swarm). For example:

zcat input1.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \
  cdna_classifier.py -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input1.fastq - \
  | gzip -c > /data/$USER/temp/full_length1.fastq.gz
zcat input2.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \
  cdna_classifier.py -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input2.fastq - \
  | gzip -c > /data/$USER/temp/full_length2.fastq.gz

Submit this job using the swarm command.

swarm -f pychopper.swarm -g 10 -t 6 --module pychopper/2.0.3
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module pychopper Loads the pychopper module for each subjob in the swarm