pychopper on Biowulf

Pychopper is used to identify, orient and trim full-length Nanopore cDNA reads. The tool is also able to rescue fused reads.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=6 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load pychopper
[user@cn3144]$ zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq
[user@cn3144]$ ## pychopper was called cdna_classifier.py in versions ≤ 2.4.0
[user@cn3144]$ pychopper -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \
                    -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz
Using kit: PCS109
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Total fastq records in input file: 25000
Tuning the cutoff parameter (q) on 9465 sampled reads (40.0%) passing quality filters (Q ≥ 7.0).
Optimizing over 30 cutoff values.
100%|████████████████████████████████████████████████████████| 30/30
Best cutoff (q) value is 0.3448 with 88% of the reads classified.
Processing the whole dataset using a batch size of 4166:
 94%|██████████████████████████████████████████████████      | 23614/25000
Finished processing file: input.fastq
Input reads failing mean quality filter (Q < 7.0): 1386 (5.54%)
Output fragments failing length filter (length < 50): 0
-----------------------------------
Reads with two primers: 86.93%
Rescued reads:          3.16%
Unusable reads:         9.91%
-----------------------------------

Move the rescuted and unclassified reads and the reports if you need them before ending the session.

[user@cn3144]$ gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz
[user@cn3144]$ gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz
[user@cn3144]$ mv report.pdf /data/$USER/temp
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. pychopper.sh), which uses the input file 'pychopper.in'. For example:

#!/bin/bash
module load pychopper/2.7.1 || exit 1
cd /lscratch/$SLURM_JOB_ID
zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq
## use cdna_classifier.py instead of pychopper for versions ≤ 2.4.0
pychopper -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \
    -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz
gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz
gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz
mv report.pdf /data/$USER/temp

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=10g pychopper.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. pychopper.swarm). For example:

zcat input1.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \
  pychopper -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input1.fastq - \
  | gzip -c > /data/$USER/temp/full_length1.fastq.gz
zcat input2.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \
  pychopper -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input2.fastq - \
  | gzip -c > /data/$USER/temp/full_length2.fastq.gz

Submit this job using the swarm command.

swarm -f pychopper.swarm -g 10 -t 6 --module pychopper/2.0.3
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module pychopper Loads the pychopper module for each subjob in the swarm