Biowulf High Performance Computing at the NIH
Data-Driven Bash Scripting

This page attempts to teach Bash scripting from a data-driven, or task-oriented approach. Regional weather data was downloaded for around two weeks from NOAA. The goal of the commands on this page is to parse the data using simple scripting techniques.

These lessons assume you have a Biowulf account and are working within a temporary, local scratch disk on a node. Any changes made to the downloaded files during the session will be lost unless the files are copied to your own /home or /data directory!

Getting started

back to top

Log into Biowulf:

ssh user@biowulf.nih.gov

Make sure you are running Bash:

echo $0

Allocate an interactive session with local scratch space so that I/O operations on the data are as fast as possible. This command assumes you are already logged into Biowulf. By default, you will have access to 2 CPUs and 1.5GB of RAM. The session will last for 8 hours.

sinteractive --gres=lscratch:5

Copy the data into the local scratch space.

cd /lscratch/$SLURM_JOB_ID
mkdir bash_class
cd bash_class
cp -R /data/classes/bash/* .

If you are not on Helix or Biowulf, you can download the data using either wget or curl.

wget

wget https://hpc.nih.gov/training/handouts/BashScripting.tgz

curl

curl https://hpc.nih.gov/training/handouts/BashScripting.tgz > BashScripting.tgz

Untar the data

tar -x -v -f BashScripting.tgz

Have a look at the data. There should be about 63MB of data, 71MB including examples and scripts.

ls
ls data/weather-2017-10-09-00-00-01/
ls data/weather-2017-10-09-00-00-01/md/
file data/weather-2017-10-09-00-00-01/md/mdz009.txt
stat data/weather-2017-10-09-00-00-01/md/mdz009.txt
tree data
cat data/weather-2017-10-09-00-00-01/md/mdz009.txt

Get down and dirty with the data

back to top

Run some simple and piped commands to display the temperature, humidity, and air pressure of a given city at a given time.

Look at all the data

cat data/weather-2017-10-10-06-00-01/md/mdz009.txt

Pull out only the lines that are for Gaithersburg

grep GAITHERSBURG data/weather-2017-10-10-06-00-01/md/mdz009.txt

Case insensitive search this time

grep gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt
grep --ignore-case gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt

Get all the timepoints for the data, using --recursive or -R

grep --recursive --ignore-case gaithersburg data/*/md

Get rid of the file information

grep --no-filename --recursive --ignore-case gaithersburg data/*/md

Sort the data

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort

Sort based on temperature - fail

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -nk3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -rnk3

Sort using character position, rather than delimiter

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | cut -c1-14,26-27
grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.26,1.27

Reverse the sort -- channge ascending to descending

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.26,1.27r

Sort ascending by relative humidity -- fail because of leading space -- fix by treating as numeric

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35
grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n

Multi-column sort

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n -k 1.26,1.27r

Isolate the most humid moment in time

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | tail
grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | tail -n 1
grep --recursive "FOG       65  65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG*  FOG       65  65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG\*  FOG       65  65 100 CALM" data/*/md/*

Isolate the least humid moment in time

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | head -n 1
grep --recursive "GAITHERSBURG\*  MOSUNNY   83  64  52 S15       30.02F" data/*/md/*
grep --recursive 'GAITHERSBURG\*  MOSUNNY   83  64  52 S15       30.02F' data/*/md/*

Get the average temperature

grep  -h -r -i gaithersburg data/*/md | cut -c26-27 | awk '{sum+=$1; num++ } END { print sum,num,sum/num}'

Scripting

back to top

Had enough of grep? Had enough of typing again and again? Let's create a script. nano is a simple text-based file editor that is intuitive and easy to use.

nano script_01a.sh
  GNU nano 2.0.9                              File: script_01a.sh                                                       

grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT'
grep -h -r -i gaithersburg data/*







^G Get Help         ^O WriteOut         ^R Read File        ^Y Prev Page        ^K Cut Text         ^C Cur Pos
^X Exit             ^J Justify          ^W Where Is         ^V Next Page        ^U UnCut Text       ^T To Spell

Now run the script.

bash script_01a.sh

This should give a long list of stuff. We want to paste the timestamps onto the weather data. Open the file again and edit it, redirecting the output and adding the paste command.

nano script_01b.sh
grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' > 1
grep -h -r -i gaithersburg data/* > 2
paste 1 2

Now it looks nicer.

bash script_01b.sh

Reduce, reuse, recycle

back to top

Rather than edit script files, it would be easier to pass the name of a city into the script to get the data.

nano script_02a.sh
grep -h -r -i -B100 $1 data/* | grep 'M EDT' > 1
grep -h -r -i $1 data/* > 2
paste 1 2

Now pass a city name as an argument to the script.

bash script_02a.sh gaithersburg
bash script_02a.sh leesburg
bash script_02a.sh manassas
bash script_02a.sh dulles

Walking the data

back to top

Use for loops to generate tables of the data.

nano script_03a.sh
for city in gaithersburg leesburg manassas dulles
do
    grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1
    grep -h -r -i $city data/* > 2
    paste 1 2
done
bash script_03a.sh | sort -k6,6n -k2,2 -k1,1n
nano script_03b.sh
for city in {gaithersburg,leesburg,manassas,dulles}
do
    grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1
    grep -h -r -i $city data/* > 2
    paste 1 2
done
bash script_03b.sh | sort -k6,6n -k2,2 -k1,1n

Parsing the data

back to top

while .. read .. line is the way to walk through a single file. The file is redirected into STDIN.

while read line ; do echo $line ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "$line" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "--- $line ---" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt

Create a script to do this.

nano script_04a.sh
while read line
do
    echo "--- $line ---"
done
nano script_04b.sh
while read line
do
    echo "---${line}---"
done

Dealing with adversity

back to top

Use conditionals to handle unknowns and odd issues.

Only grab the data below "CITY"

nano script_04c.sh
while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
    fi
    if [[ -n $good ]]; then
        echo " ---${line}---- "
    fi
done
bash script_04c.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Don't print "CITY"

nano script_04d.sh
while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        echo " ---${line}---- "
    fi
done
bash script_04d.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Get rid of remainder

nano script_04e.sh
while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if ( echo "$line" | grep -q "^$" ) ; then
            exit
        else
            echo " ---${line}---- "
        fi
    fi
done
bash script_04e.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Yuck, get rid of '$$' as well

nano script_04f.sh

this won't work

        if ( echo "$line" | grep -q "$$" ) ; then
            exit
        else

this will, but we're still left with blank lines

        if ( echo "$line" | grep -q '$$' ) ; then
            exit
        else

more cleverness

        if ( echo "$line" | grep -q '$$' ) ; then
            exit
        elif ( echo "$line" | grep -q "^$" ) ; then
            exit
        else

even more cleverness

        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi

now walk all the files

for file in data/weather-2017-10-10-00-00-02/md/* ; do bash script_04f.sh < $file ; done

Wot? Repeats?

for file in data/weather-2017-10-10-00-00-02/md/* ; do bash script_04f.sh < $file ; done | sort -u

Regular expressions

back to top

Huh? Rolling around through the data shows some repeats

---ANNAPOLIS      CLEAR     65  48  54 VRB7      29.65F----
---ANNAPOLIS      CLEAR     80  77  90 CALM      29.98R----

---BWI AIRPORT    MOCLDY    70  49  47 S12G20    29.63F----
---BWI AIRPORT    PTCLDY    75  75 100 CALM      29.98R----

---MD SCIENCE CTR   N/A     60  46  59 MISG      29.64F----
---MD SCIENCE CTR   N/A     79  75  87 MISG      29.97R----

Use the Expires tag to filter out very old data:

nano script_05a.sh
regex="^Expires:([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})"
cutoff=201710010000

while read line
do

    if [[ "$line" =~ $regex ]]; then
        echo year=${BASH_REMATCH[1]}
        echo month=${BASH_REMATCH[2]}
        echo day=${BASH_REMATCH[3]}
        echo hour=${BASH_REMATCH[4]}
        echo minute=${BASH_REMATCH[5]}
        if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}${BASH_REMATCH[5]}" ]]; then
            exit
        fi

    fi

    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi
    fi
done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -10) ; do bash script_05a.sh < $file ; done | sort -u

Simpler:

nano script_05b.sh
regex="^Expires:([0-9]{12})"
cutoff=201710010000

while read line
do

    if [[ "$line" =~ $regex ]]; then
        if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
            exit
        fi
    fi

    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi
    fi
done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -10) ; do bash script_05b.sh < $file ; done | sort -u

Automation

back to top

Create a function to automate what we did:

nano script_06a.sh
cutoff=201710010000

function extract_data
{
    regex="^Expires:([0-9]{12})"
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                exit
            fi
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${line}
                else
                    value=${value}$'\n'${line}
                fi
            fi
        fi
    done < $1
    echo "$value"
}

extract_data $1
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -10) ; do bash script_06a.sh < $file ; done | sort -u

That's good, but what if we want all the data for a given time and state?

nano script_06b.sh
cutoff=201710010000

function extract_data
{
    regex="^Expires:([0-9]{12})"
    unset good
    unset value
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${line}
                else
                    value=${value}$'\n'${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u
bash script_06b.sh data/weather-2017-10-10-00-00-02/az

Selectivity and format

back to top

Getting back to time stamps. How can we format time?

nano script_07a.sh
cutoff=201710010000

function month_str_to_num {
# Convert month_3char to numeric
    case $1 in
        "JAN") echo 1 ;;
        "FEB") echo 2 ;;
        "MAR") echo 3 ;;
        "APR") echo 4 ;;
        "MAY") echo 5 ;;
        "JUN") echo 6 ;;
        "JUL") echo 7 ;;
        "AUG") echo 8 ;;
        "SEP") echo 9 ;;
        "OCT") echo 10 ;;
        "NOV") echo 11 ;;
        "DEC") echo 12 ;;
        *) { echo Bad month format; exit 1; } ;;
    esac
}

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

  if [[ "$ap" == "PM" ]]; then
      ((h+=12))
  fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local month=$(month_str_to_num $month_3char)
        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo "${year}-${month}-${day}T${clock}"

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value="${timestr}  ${line}"
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

Simplify. The date command can parse time, somewhat:

nano script_07b.sh
cutoff=201710010000

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

  if [[ "$ap" == "PM" ]]; then
      ((h+=12))
  fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${timestr}$' '${line}
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

Stupid bug

nano script_07c.sh

cutoff=201710010000

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

    if [[ "$ap" == "PM" ]]; then
        if [[ $h != 12 ]]; then
            ((h+=12))
        fi
    fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${timestr}$' '${line}
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

What?

diff script_07c.sh script_07b.sh
20,24c20,22
<     if [[ "$ap" == "PM" ]]; then
<         if [[ $h != 12 ]]; then
<             ((h+=12))
<         fi
<     fi
---
>   if [[ "$ap" == "PM" ]]; then
>       ((h+=12))
>   fi

Compartmentalization

back to top

Our script is getting out of hand. So, create a separate file to hold the functions, then source it:

nano script_08a.sh
cutoff=201710010000

source function.sh

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d) ; do
    for file in $(find $collection/md/ -type f) ; do
        extract_data $file
    done | sort -u
done

Kind of messy, add a sort step:

nano script_08b.sh
cutoff=201710010000

source function.sh

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f) ; do
        extract_data $file
    done | sort -u
done

Parallelization

back to top

It's kind of slow to parse each file, once after another. Instead, let's parse them in parallel:

nano script_09a.sh
cutoff=201710010000

source function.sh

file_array=()

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f -name "*.txt") ; do
        file_array+=($file)
    done
done

parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u

... except that this fails. We need the functions to become elevated to the environment:

nano script_09b.sh
export cutoff=201710010000

source function.sh
export -f timestr_to_clock
export -f parse_time_stamp
export -f extract_data

file_array=()

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f -name "*.txt") ; do
        file_array+=($file)
    done
done

parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u

How much speed up do we get?

time bash script_08b.sh > /dev/null

real	1m31.466s
user	0m26.841s
sys	0m24.627s
time bash script_09b.sh > /dev/null

real	0m11.546s
user	0m23.677s
sys	0m18.794s