xTea (x-Transposable element analyzer), is a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for shortread data, xTea can be applied to both short-read and long-read data. xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=8g -c8 --gres=lscratch:10 [user@cig 3335 ~]$ module load xTea [+] Loading singularity 3.10.5 on cn3335 [+] Loading xTea 0.1.9 [user@cn3335 ~]$ xtea -h Usage: xtea [options] Options: -h, --help show this help message and exit -D, --decompress Decompress the rep lib and reference file -M, --mosaic Calling mosaic events from high coverage data -C, --case_control Run in case control mode --denovo Run in de novo mode -U, --user Use user specific parameters instead of automatically calculated ones --force Force to start from the very beginning --hard This is hard-cut for fitering out coverage abnormal candidates --tumor Working on tumor samples --purity=PURITY Tumor purity --lsf Indiates submit to LSF system --slurm Indiates submit to slurm system --resume Resume the running, which will skip the step if output file already exists! -V, --version Print xTea version -i FILE, --id=FILE sample id list file -a FILE, --par=FILE parameter file -l FILE, --lib=FILE TE lib config file -b FILE, --bam=FILE Input bam file -x FILE, --x10=FILE Input 10X bam file12878 -n CORES, --cores=CORES number of cores -m MEMORY, --memory=MEMORY Memory limit in GB -q PARTITION, --partition=PARTITION Which queue to run the job -t TIME, --time=TIME Time limit -p WFOLDER, --path=WFOLDER Working folder -r REF, --ref=REF reference genome -g GENE, --gene=GENE Gene annotation file --xtea=XTEA xTEA folder -f FLAG, --flag=FLAG Flag indicates which step to run (1-clip, 2-disc, 4-barcode, 8-xfilter, 16-filter, 32-asm) -y REP_TYPE, --reptype=REP_TYPE Type of repeats working on: 1-L1, 2-Alu, 4-SVA, 8-HERV, 16-Mitochondrial --flklen=FLKLEN flank region file --nclip=NCLIP cutoff of minimum # of clipped reads --cr=CLIPREP cutoff of minimum # of clipped reads whose mates map in repetitive regions --nd=NDISC cutoff of minimum # of discordant pair --nfclip=NFILTERCLIP cutoff of minimum # of clipped reads in filtering step --nfdisc=NFILTERDISC cutoff of minimum # of discordant pair of each sample in filtering step --teilen=TEILEN minimum length of the insertion for future analysis -o FILE, --output=FILE The output file --blacklist=FILE Reference panel database for filtering, or a blacklist region [user@cn3335 ~]$ ls $XTEA_BIN python shell xtea xtea_hg19 xtea_longDownlopad xTea source code:
[user@cn3335 ~]$ wget https://github.com/parklab/xTea/archive/refs/tags/v0.1.9.tar.gz [user@cn3335 ~]$ tar -zxf v0.1.9.tar.gz && rm -f v0.1.9.tar.gzEnter the Demo folder, download and preprocess sample data:
[user@cn3335 ~]$ cd xTea-0.1.9/Demo [user@cn3335 ~]$ python $XTEA_SRC/gnrt_pipeline_local.py -h Usage: gnrt_pipeline_local.py [options] Options: -h, --help show this help message and exit -D, --decompress Decompress the rep lib and reference file -M, --mosaic Calling mosaic events from high coverage data -C, --case_control Run in case control mode --denovo Run in de novo mode -U, --user Use user specific parameters instead of automatically calculated ones --force Force to start from the very beginning --tumor Working on tumor samples --purity=PURITY Tumor purity --lsf Indiates submit to LSF system --slurm Indiates submit to slurm system --resume Resume the running, which will skip the step if output file already exists! -V, --version Print xTea version -i FILE, --id=FILE sample id list file -a FILE, --par=FILE parameter file -l FILE, --lib=FILE TE lib config file -b FILE, --bam=FILE Input bam file -x FILE, --x10=FILE Input 10X bam file -n CORES, --cores=CORES number of cores -m MEMORY, --memory=MEMORY Memory limit in GB -q PARTITION, --partition=PARTITION Which queue to run the job -t TIME, --time=TIME Time limit -p WFOLDER, --path=WFOLDER Working folder -r REF, --ref=REF reference genome -g GENE, --gene=GENE Gene annotation file --xtea=XTEA xTEA folder -f FLAG, --flag=FLAG Flag indicates which step to run (1-clip, 2-disc, 4-barcode, 8-xfilter, 16-filter, 32-asm) -y REP_TYPE, --reptype=REP_TYPE Type of repeats working on: 1-L1, 2-Alu, 4-SVA, 8-HERV, 16-Mitochondrial --flklen=FLKLEN flank region file --nclip=NCLIP cutoff of minimum # of clipped reads --cr=CLIPREP cutoff of minimum # of clipped reads whose mates map in repetitive regions --nd=NDISC cutoff of minimum # of discordant pair --nfclip=NFILTERCLIP cutoff of minimum # of clipped reads in filtering step --nfdisc=NFILTERDISC cutoff of minimum # of discordant pair of each sample in filtering step --teilen=TEILEN minimum length of the insertion for future analysis -o FILE, --output=FILE The output file --blacklist=FILE Reference panel database for filtering, or a blacklist regionPrepare the data to be used:
[user@cn3335 ~]$ wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR194/ERR194147/NA12878_S1.bam [user@cn3335 ~]$ samtools index NA12878_S1.bam [user@cn3335 ~]$ wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.chr_patch_hapl_scaff.basic.annotation.gff3.gz [user@cn3335 ~]$ gunzip gencode.v33.chr_patch_hapl_scaff.basic.annotation.gff3.gz [user@cn3335 ~]$ ln -s /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa [user@cn3335 ~]$ ln -s /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.index [user@cn3335 ~]$ ln -s /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai [user@cn3335 ~]$ wget https://github.com/brentp/poverlap/blob/master/data/hg19.centromere.bed [user@cn3335 ~]$ wget https://github.com/parklab/xTea/raw/master/rep_lib_annotation.tar.gz [user@cn3335 ~]$ tar -zxf rep_lib_annotation.tar.gz [user@cn3335 ~]$ cat sample_id.txt NA12878 [user@cn3335 ~]$ cat sample_bam.txt' NA12878 ./NA12878_S1.bamCreate a script to be submitted to the cluster:
[user@cn3335 ~]$ python $XTEA_SRC/gnrt_pipeline_local.py -i sample_id.txt -b sample_bam.txt \ -p . -o submit_jobs.sh -q short -n 8 -m 16 -t 0-05:00 \ -l ./ -r ./genome.fa \ -g $XTEA_DATA/gencode.v33lift37.annotation.gff3 \ --xtea ../../xTea-0.1.9/xtea --nclip 4 --cr 2 --nd 5 --nfclip 4 --nfdisc 5 \ --flklen 3000 -f 5907 -y 7 --slurm -q norm \ --blacklist ./hg19.centromere.bed [user@cn3335 ~]$ cat submit_jobs.sh #!/bin/bash sbatch < ./NA12878/L1/run_xTEA_pipeline.sh sbatch < ./NA12878/Alu/run_xTEA_pipeline.sh sbatch < ./NA12878/SVA/run_xTEA_pipeline.shSubmit the script:
[user@cn3335 ~]$ source submit_jobs.sh 59538871 59538874 59538877