pychopper on Biowulf
Pychopper is used to identify, orient and trim full-length Nanopore cDNA reads. The tool is also able to rescue fused reads.
Documentation
- pychopper on GitHub
Important Notes
- Module Name: pychopper (see the modules page for more information)
- pychopper can use multiple CPUs. Please match your allocation with the number of threads used
- pychopper cannot read compressed fastq files natively and under some common conditions reads the fastq file twice. Therefore it's best to unpack fastq files into lscratch. See example below.
- Example files in
$PYCHOPPER_TEST_DATA
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=10g --cpus-per-task=6 --gres=lscratch:10 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load pychopper [user@cn3144]$ zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq [user@cn3144]$ cdna_classifier.py -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \ -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP" Total fastq records in input file: 25000 Tuning the cutoff parameter (q) on 9834 sampled reads (40.0%). Optimizing over 30 cutoff values. 100%|██████████████████████████████████| 30/30 [04:53<00:00, 9.79s/it] Best cutoff (q) value is 1.0345 with 92% of the reads classified. Processing the whole dataset using a batch size of 3125: 100%|██████████████████████████████████| 25000/25000 [00:28<00:00, 884.24it/s]
Move the rescuted and unclassified reads and the reports if you need them before ending the session.
[user@cn3144]$ gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz [user@cn3144]$ gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz [user@cn3144]$ mv report.pdf /data/$USER/temp [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. pychopper.sh), which uses the input file 'pychopper.in'. For example:
#!/bin/bash module load pychopper/2.0.3 || exit 1 cd /lscratch/$SLURM_JOB_ID zcat $PYCHOPPER_TEST_DATA/SIRV_E0_pcs109_25k.fq.gz > input.fastq cdna_classifier.py -r report.pdf -u unclassified.fastq -t $SLURM_CPUS_PER_TASK \ -w rescued.fastq input.fastq - | gzip -c > /data/$USER/temp/full_length.fastq.gz gzip -c rescued.fastq > /data/$USER/temp/rescued.fastq.gz gzip -c unclassified.fastq > /data/$USER/temp/unclassified.fastq.gz mv report.pdf /data/$USER/temp
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=6 --mem=10g pychopper.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. pychopper.swarm). For example:
zcat input1.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \ cdna_classifier.py -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input1.fastq - \ | gzip -c > /data/$USER/temp/full_length1.fastq.gz zcat input2.fastq.gz > /lscratch/$SLURM_JOB_ID/input1.fastq && \ cdna_classifier.py -t $SLURM_CPUS_PER_TASK /lscratch/$SLURM_JOB_ID/input2.fastq - \ | gzip -c > /data/$USER/temp/full_length2.fastq.gz
Submit this job using the swarm command.
swarm -f pychopper.swarm -g 10 -t 6 --module pychopper/2.0.3where
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module pychopper | Loads the pychopper module for each subjob in the swarm |