High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
FastQC on Helix & Biowulf

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are

FastQC is developed by Simon Andrews, Babraham Bioinformatics.

Make sure X-windows is running while connecting to helix.

Running on Helix

Sample session:

$ module load fastqc
$ fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN

Batch job on Biowulf

First create a batch file alone the following lines:

#!/bin/bash

cd /data/$USER/mydir

module load fastqc
fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN

Then submit this batch file to the cluster:

$ sbatch myscript

The default memory is 4 GB. Users may request more memory using the --mem flag:

$ sbatch --mem=10g myscript

 

Running an interactive job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

 
biowulf$ sinteractive 
    salloc.exe: Granted job allocation 176863

[user@cnXXX]$ module load fastqc 

[user@cnXXX]$ cd /data/$USER/fastqc/run1
[user@cnXXX]$ fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN [user@cnXXX] exit salloc.exe: Relinquishing job allocation 176863 salloc.exe: Job allocation 176863 has been revoked.

Running a swarm job

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

cd /data/user/somedir1; fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN
cd /data/user/somedir2; fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN
cd /data/user/somedir3; fastqc -o output_dir [-f fastq|bam|sam] -c contaminant_file seqfile1 .. seqfileN
[...]

Submit this job with:

$ swarm -f cmdfile --module fastqc

By default, each line of the swarm command file will be executed on 1 core and 1.5 GB of memory. If you need more, then you should specify the required memory to swarm by using the -g # flag, where # is the number of GB of memory required. For example, if each command requires 10 GB of memory, submit with:

$ swarm -g 10 -f cmdfile --module fastqc

For more information regarding running swarm, see swarm.html

Documentation

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/