FCS on Biowulf

FCS is a toolset to elimate contaminant sequences from a genome assembly. Currently, there are two tools in this toolset: fcs-adaptor and fcs-gx.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. From the FCS tutorials at the FCS Github site:

[user@biowulf]$ sinteractive --mem=3g --cpus-per-task=4 --gres=lscratch:100
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn4224 are ready for job

[user@cn4224 ~]$ module load fcs
[+] Loading fcs  0.4.0  on cn4285
[+] Loading singularity  3.10.5  on cn4285

[user@cn4244 ~]$ cd /lscratch/${SLURM_JOB_ID}
[user@cn4224 ~]$ cp ${FCS_TEST_DATA}/fcsadaptor_prok_test.fa.gz inputdir/.
[user@cn4224 ~]$ mkdir inputdir outputdir
[user@cn4224 ~]$ run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test.fa.gz --output-dir ./outputdir --prok --container-engine singularity --image ${FCS_HOME}/fcs-adaptor.sif
[WARN  tini (2647065)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
Output will be placed in: /output-volume
Executing the workflow
Resolved '/app/fcs/progs/ForeignContaminationScreening.cwl' to 'file:///app/fcs/progs/ForeignContaminationScreening.cwl'
[workflow ] start
[workflow ] starting step ValidateInputSequences
[step ValidateInputSequences] start


[job all_skipped_trims] completed success
[step all_skipped_trims] completed success
[workflow ] starting step all_cleaned_fasta
[step all_cleaned_fasta] start
[step all_cleaned_fasta] completed success
[workflow ] completed success

[user@cn4224 ~]$ cp ${FCS_TEST_DATA}/fcsgx_test.fa.gz .
[user@cn4224 ~]$ SOURCE_DB_MANIFEST="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only/test-only.manifest"
[user@cn4224 ~]$ LOCAL_DB=/lscratch/${SLURM_JOB_ID}/gxdb
[user@cn4224 ~]$ fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/test-only"
[user@cn4224 ~]$ fcs.py db check --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
Source:      https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only
Destination: /app/db/gxdb
Resuming failed transfer in /app/db/gxdb...
Space check: Available:15.71GiB; Existing:0B; Incoming:4.29GiB; Delta:4.29GiB

Requires transfer: 56B test-only.meta.jsonl

Requires transfer: 5.92KiB test-only.taxa.tsv

Requires transfer: 21.33KiB test-only.seq_info.tsv.gz

Requires transfer: 7.85MiB test-only.blast_div.tsv.gz

Requires transfer: 67.56MiB test-only.gxs

Requires transfer: 4.21GiB test-only.gxi

[user@cn4224 ~]$ GXDB_LOC=/lscratch/${SLURM_JOB_ID}/gxdb
[user@cn4224 ~]$ fcs.py screen genome --fasta ./fcsgx_test.fa.gz --out-dir ./gx_out/ --gx-db "$GXDB_LOC/test-only"  --tax-id 6973

tax-id    : 6973
fasta     : /sample-volume/fcsgx_test.fa.gz
size      : 8.55 MiB
split-fa  : True
BLAST-div : roaches
gx-div    : anml:insects
w/same-tax: True
bin-dir   : /app/bin
gx-db     : /app/db/gxdb/test-only/test-only.gxi
gx-ver    : Mar 10 2023 15:34:33; git:v0.4.0-3-g8096f62
output    : /output-volume//fcsgx_test.fa.6973.taxonomy.rpt



fcs_gx_report.txt contamination summary:
                                seqs      bases
                               ----- ----------
TOTAL                            243   27170378
-----                          ----- ----------
prok:CFB group bacteria          243   27170378


fcs_gx_report.txt action summary:
                                seqs      bases
                               ----- ----------
TOTAL                            243   27170378
-----                          ----- ----------
EXCLUDE                          214   25795430
REVIEW                            29    1374948


[user@cn4224 ~]$ zcat fcsgx_test.fa.gz | fcs.py clean genome --action-report ./gx_out/fcsgx_test.fa.6973.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
Applied 214 actions; 25795430 bps dropped; 0 bps hardmasked.

[user@cn4224 ~]$ ls gx_out
fcsgx_test.fa.6973.fcs_gx_report.txt  fcsgx_test.fa.6973.taxonomy.rpt

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. fcsadaptor.sh) similar to the following.

#! /bin/bash

module load fcs
cd /lscratch/${SLURM_JOB_ID}
mkdir inputdir outputdir
cp ${FCS_TEST_DATA}/fcsadaptor_prok_test.fa.gz inputdir/.

run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test.fa.gz --output-dir ./outputdir --prok --container-engine singularity --image ${FCS_HOME}/fcs-adaptor.sif

cp -r outputdir /data/${USER}/.

Submit these jobs using the Slurm sbatch command.

Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile to run fcs-adaptor (e.g. fcsadaptor.swarm). For example:

run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test1.fa.gz --output-dir ./outputdir1 --prok --container-engine singularity --image ${FCS_HOME}/fcs-adaptor.sif; cp -r outputdir1 /data/${USER}/. 
run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test2.fa.gz --output-dir ./outputdir2 --prok --container-engine singularity --image ${FCS_HOME}/fcs-adaptor.sif; cp -r outputdir2 /data/${USER}/.
run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test3.fa.gz --output-dir ./outputdir3 --prok --container-engine singularity --image ${FCS_HOME}/fcs-adaptor.sif; cp -r outputdir3 /data/${USER}/.

Submit this job using the swarm command.

swarm -f fcsadaptor.swarm [-g #] --module fcs
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
--module fcs Loads the fcs module for each subjob in the swarm