High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Snphylo on Biowulf & Helix

SNPhylo is a pipeline to generate a phylogenetic tree from huge SNP data. Features include:

[SNPhylo website]

SNPhylo can be cpu and memory intensive so should not be run on Helix.

Batch job on Biowulf

The example below uses the soybean hapmap data that is available in /usr/local/apps/snyphylo/soybean.hapmap.gz. Set up a batch script along the following lines:

#!/bin/bash

module load snphylo

cd /data/$USER/snphylo
cp /usr/local/apps/snphylo/soybean.hapmap.gz .
gunzip soybean.hapmap.gz
snphylo.sh -H soybean.hapmap

Submit this job with:

sbatch  jobscript
Interactive job on Biowulf

Allocate an interactive node on Biowulf and run Snphylo there. Sample session:

[susanc@biowulf ~]$ sinteractive
salloc.exe: Granted job allocation 181190
[susanc@cn0045 ~]$ module load snphylo
[+] Loading gcc 4.4.7 ...
[+] Loading OpenMPI 1.8.1 for GCC 4.4.7 (ethernet) ...
[+] Loading tcl_tk 8.6.1
[+] Loading ATLAS 3.8.4 libraries...
[+] Loading R 3.2.0 on cn0045

[+] Loading muscle 3.8.31
[+] Loading python 2.7.9 ...

[susanc@cn0045 snphylo]$ snphylo.sh -H soybean.hapmap
Start to remove low quality data.

3931995 low quality lines were removed

WARNING: ignoring environment value of R_HOME
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
Start HapMap2GDS ...
	Scanning ...
	file: snphylo.output.filtered.hapmap
	content: 2357753 rows x 42 columns
Tue Jun 30 11:38:07 2015 	store sample id, snp id, position, and chromosome.
	start writing: 31 samples, 2357752 SNPs ...
	file: snphylo.output.filtered.hapmap
Tue Jun 30 11:56:59 2015 	Done.
Hint: it is suggested to call `snpgdsOpen' to open a SNP GDS file instead of `openfn.gds'.
SNP pruning based on LD:
Excluding 0 SNP on non-autosomes
Excluding 703227 SNPs (monomorphic: TRUE, < MAF: 0.1, or > missing rate: 0.1)
Working space: 31 samples, 1654525 SNPs
	Using 1 (CPU) core
	Sliding window: 500000 basepairs, Inf SNPs
	|LD| threshold: 0.1
Chromosome 1: 0.21%, 258/121027
Chromosome 2: 0.24%, 258/109355
Chromosome 3: 0.21%, 258/121945
Chromosome 4: 0.19%, 244/126613
Chromosome 5: 0.24%, 202/85247
Chromosome 6: 0.20%, 254/127325
Chromosome 7: 0.21%, 231/107838
Chromosome 8: 0.19%, 238/123126
Chromosome 9: 0.21%, 235/112710
Chromosome 10: 0.22%, 237/105578
Chromosome 11: 0.21%, 202/95764
Chromosome 12: 0.20%, 208/104896
Chromosome 13: 0.20%, 252/126298
Chromosome 14: 0.18%, 242/131612
Chromosome 15: 0.18%, 268/149827
Chromosome 16: 0.21%, 207/100444
Chromosome 17: 0.21%, 203/95319
Chromosome 18: 0.18%, 316/171046
Chromosome 19: 0.21%, 255/122479
Chromosome 20: 0.20%, 236/119303
4804 SNPs are selected in total.
Nucleic acid sequence Maximum Likelihood method, version 3.696

Settings for this run:
  U                 Search for best tree?  Yes
  T        Transition/transversion ratio:  2.0000
  F       Use empirical base frequencies?  Yes
  C                One category of sites?  Yes
  R           Rate variation among sites?  constant rate
  W                       Sites weighted?  No
  S        Speedier but rougher analysis?  Yes
  G                Global rearrangements?  No
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, ANSI, none)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes
  5   Reconstruct hypothetical sequences?  No

  Y to accept these or type the letter for one to change

Adding species:
   1. W01
   2. W02
   3. W03
   4. W04
   5. W05
   6. W06
   7. W07
   8. W08
   9. W09
  10. W10
  11. W11
  12. W12
  13. W13
  14. W14
  15. W15
  16. W16
  17. W17
  18. C01
  19. C02
  20. C08
  21. C12
  22. C14
  23. C16
  24. C17
  25. C19
  26. C24
  27. C27
  28. C30
  29. C33
  30. C34
  31. C35

Output written to file "outfile"

Tree also written onto file "outtree"

Done.

WARNING: ignoring environment value of R_HOME
null device
          1
The tree file (snphylo.output.xx.tree) and the image (snphylo.output.xx.png) are successfully generated!
Now, you can see the tree by a program such as MEGA4 (http://www.megasoftware.net/mega4/mega.html), FigTree (http://tree.bio.ed.ac.uk/software/figtree/) and Newick utilities (http://cegg.unige.ch/newick_utils).
Good Luck!
[susanc@cn0045 ~]$ exit
exit
salloc.exe: Relinquishing job allocation 181190
salloc.exe: Job allocation 181190 has been revoked.
[susanc@biowulf ~]$	
Documentation

SNPhylo website