Biowulf High Performance Computing at the NIH
Adam on Biowulf

ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis across cluster/cloud computing environments. ADAM uses a set of schemas to describe genomic sequences, reads, variants/genotypes, and features, and can be used with data in legacy genomic file formats such as SAM/BAM/CRAM, BED/GFF3/GTF, and VCF, as well as data stored in the columnar Apache Parquet format. On a single node, ADAM provides competitive performance to optimized multi-threaded tools, while enabling scale out to clusters with more than a thousand cores. ADAM's APIs can be used from Scala, Java, Python, R, and SQL.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem 4g -c 8
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load adam

[user@cn3144 ~]$ adam-submit transformAlignments /usr/local6/apps/adam/examples/adam_small.sam sample.adam
2019-01-28 09:03:54 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-01-28 09:03:54 INFO  ADAMMain:109 - ADAM invoked with args: "transformAlignments" "/usr/local6/apps/adam/examples/adam_small.sam" "/data/teacher/gcat_set.adam"
2019-01-28 09:03:54 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-01-28 09:03:54 INFO  SparkContext:54 - Submitted application: transformAlignments
2019-01-28 09:03:54 INFO  SecurityManager:54 - Changing view acls to: teacher
2019-01-28 09:03:54 INFO  SecurityManager:54 - Changing modify acls to: teacher
2019-01-28 09:03:54 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-01-28 09:03:54 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-01-28 09:03:54 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(teacher); groups with view permissions: Set(); users  with modify permissions: Set(teacher); groups with modify permissions: Set()
2019-01-28 09:03:55 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 46661.
2019-01-28 09:03:55 INFO  SparkEnv:54 - Registering MapOutputTracker
2019-01-28 09:03:55 INFO  SparkEnv:54 - Registering BlockManagerMaster
2019-01-28 09:03:55 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-01-28 09:03:55 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-01-28 09:03:55 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-313eaeb9-e1e7-4ade-879b-30c26077462d
2019-01-28 09:03:55 INFO  MemoryStore:54 - MemoryStore started with capacity 366.3 MB
[...]
2019-01-28 09:03:59 INFO  ParquetFileReader:209 - Initiating action with parallelism: 5
2019-01-28 09:03:59 INFO  SparkHadoopWriter:54 - Job job_20190128090358_0002 committed.
2019-01-28 09:03:59 INFO  TransformAlignments:44 - Overall Duration: 5.05 secs
2019-01-28 09:03:59 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2019-01-28 09:03:59 INFO  AbstractConnector:318 - Stopped Spark@5d332969{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-01-28 09:03:59 INFO  SparkUI:54 - Stopped Spark web UI at http://cn3203:4040
2019-01-28 09:03:59 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-01-28 09:03:59 INFO  MemoryStore:54 - MemoryStore cleared
2019-01-28 09:03:59 INFO  BlockManager:54 - BlockManager stopped
2019-01-28 09:03:59 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-01-28 09:03:59 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-01-28 09:03:59 INFO  SparkContext:54 - Successfully stopped SparkContext
2019-01-28 09:03:59 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-01-28 09:03:59 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4b2ba0a6-2ba9-4cf1-b3f6-7ff4fa55f81e
2019-01-28 09:03:59 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-08d97999-c69f-4ceb-88f7-587d5d066aea


[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Note: for large scale jobs, please refer to the NIH HPC Spark page on how to start a cluster

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. adam.sh). For example:

#!/bin/bash
module load adam
adam-submit transformAlignments /usr/local6/apps/adam/examples/adam_small.sam sample.adam

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=8 --mem=16g adam.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. adam.swarm). For example:

adam-submit transformAlignments sample1.bam sample1.adam
adam-submit transformAlignments sample2.bam sample2.adam
adam-submit transformAlignments sample3.bam sample3.adam

Submit this job using the swarm command.

swarm -f adam.swarm -g 16 -t 8 --module adam
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module adam Loads the adam module for each subjob in the swarm