Bioawk extends awk with support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
There may be multiple versions of bioawk available. An easy way of selecting the version is to use modules. To see the modules available, type
module avail bioawk
To select a module use
module load bioawk/[version]
where [version]
is the version of choice.
$PATH
$BIOAWK_TEST_DATA
Allocate an interactive session with sinteractive and use as described below
biowulf$ sinteractive node$ module load bioawk samtools node$ # what formats are supported and what automatic variables are created for each of them? node$ bioawk -c help bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts sam: 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute fastx: 1:name 2:seq 3:qual 4:comment node$ # create fasta from bam file. Reverse complement the read sequence if the node$ # read aligned to the minus strand node$ samtools view $BIOAWK_TEST_DATA/aln.bam \ | bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}' \ | head >D00777:83:C8R9PACXX:3:2111:6151:30663 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACACTGACGCGAACGGAAAGAGTAA >D00777:83:C8R9PACXX:3:2111:6151:30663 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACACTGACGCGAACGGAAAGAGTAA >D00777:83:C8R9PACXX:3:2111:6151:30663 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACACTGACGCGAACGGAAAGAGTAA >D00777:83:C8R9PACXX:3:2111:6151:30663 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACACTGACGCGAACGGAAAGAGTAA >D00777:83:C8R9PACXX:3:2111:6151:30663 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACACTGACGCGAACGGAAAGAGTAA node$ samtools view $BIOAWK_TEST_DATA/aln.bam | bioawk -c sam '$seq ~ /^ATG/ {printf("%s:%i\t%s\n", $rname, $pos, $seq)}' chr1:14686 ATGGAGCCCCCTACGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCT chr1:17535 ATGCCCTGGGTCCCCACTAAGCCAGGCCGGGCCTCCCGCCCACACCCCTCGGCCCTGCCCTCTGGCCATACAGGTTCTCGGTGGTGTTGAAGAGCAGCAAG chr1:17535 ATGCCCTGGGTCCCCACTAAGCCAGGCCGGGCCTCCCGCCCACACCCCTCGGCCCTGCCCTCTGGCCATACAGGTTCTCGGTGGTGTTGAAGAGCAGCAAG node$ exit biowulf$
Create a batch script similar to the following example:
#! /bin/bash # this file is bioawk.batch module load bioawk || exit 1 bioawk -c gff '$feature == "exon" and (end - start) < 100' refseq.gff
Submit to the queue with sbatch:
biowulf$ sbatch --time=10 bioawk.batch
Create a swarm command file similar to the following example:
# this file is bioawk.swarm bioawk -c fastx '{print ">" $name; print revcomp($seq)}' seq1.fa.gz | gzip -c > seq1.rc.fa.gz bioawk -c fastx '{print ">" $name; print revcomp($seq)}' seq2.fa.gz | gzip -c > seq2.rc.fa.gz bioawk -c fastx '{print ">" $name; print revcomp($seq)}' seq3.fa.gz | gzip -c > seq3.rc.fa.gz
And submit to the queue with swarm
biowulf$ swarm -f bioawk.swarm --module bioawk --time 10