This page attempts to teach Bash scripting from a data-driven, or task-oriented approach. Regional weather data was downloaded for around two weeks from NOAA. The goal of the commands on this page is to parse the data using simple scripting techniques.
These lessons assume you have a Biowulf account and are working within a temporary, local scratch disk on a node. Any changes made to the downloaded files during the session will be lost unless the files are copied to your own /home or /data directory!
Getting started
Log into Biowulf:
ssh user@biowulf.nih.gov
Make sure you are running Bash:
echo $0
Allocate an interactive session with local scratch space so that I/O operations on the data are as fast as possible. This command assumes you are already logged into Biowulf. By default, you will have access to 2 CPUs and 1.5GB of RAM. The session will last for 8 hours.
sinteractive --gres=lscratch:5
Copy the data into the local scratch space.
cd /lscratch/$SLURM_JOB_ID mkdir bash_class cd bash_class cp -R /data/classes/bash/* . tar -C data -x -z -f data/weather_data.tgz
If you are not on Helix or Biowulf, you can download the data using either wget or curl.
wget https://hpc.nih.gov/training/handouts/BashScripting.tgz
curl https://hpc.nih.gov/training/handouts/BashScripting.tgz > BashScripting.tgz
Untar the data
tar -x -v -f BashScripting.tgz tar -C data -x -z -f data/weather_data.tgz
Have a look at the data. There should be about 63MB of data, 71MB including examples and scripts.
ls ls data/weather-2017-10-09-00-00-01/ ls data/weather-2017-10-09-00-00-01/md/ file data/weather-2017-10-09-00-00-01/md/mdz009.txt stat data/weather-2017-10-09-00-00-01/md/mdz009.txt tree data cat data/weather-2017-10-09-00-00-01/md/mdz009.txt
Get down and dirty with the data
Run some simple and piped commands to display the temperature, humidity, and air pressure of a given city at a given time.
Look at all the data
cat data/weather-2017-10-10-06-00-01/md/mdz009.txt
Pull out only the lines that are for Gaithersburg
grep GAITHERSBURG data/weather-2017-10-10-06-00-01/md/mdz009.txt
Case insensitive search this time
grep gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt
grep --ignore-case gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt
Get all the timepoints for the data, using --recursive or -R
grep --recursive --ignore-case gaithersburg data/*/md
Get rid of the file information
grep --no-filename --recursive --ignore-case gaithersburg data/*/md
Sort the data
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort
Sort based on temperature - fail
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -nk3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -rnk3
Sort using character position, rather than delimiter
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | cut -c1-14,26-27
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.26,1.27
Reverse the sort -- change ascending to descending
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.26,1.27r
Sort ascending by relative humidity -- fail because of leading space -- fix by treating as numeric
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n
Multi-column sort
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n -k 1.26,1.27r
Isolate the most humid moment in time
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | tail
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | tail -n 1
grep --recursive "FOG 65 65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG* FOG 65 65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG\* FOG 65 65 100 CALM" data/*/md/*
Isolate the least humid moment in time
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | head -n 1
grep --recursive "GAITHERSBURG\* MOSUNNY 83 64 52 S15 30.02F" data/*/md/*
grep --recursive 'GAITHERSBURG\* MOSUNNY 83 64 52 S15 30.02F' data/*/md/*
Get the average temperature
grep -h -r -i gaithersburg data/*/md | cut -c26-27 | awk '{sum+=$1; num++ } END { print sum,num,sum/num}'
Scripting
Had enough of grep? Had enough of typing again and again? Let's create a script. nano is a simple text-based file editor that is intuitive and easy to use.
nano scripts/script_01a.sh
GNU nano 2.3.1 File: scripts/script_01a.sh grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' grep -h -r -i gaithersburg data/* ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell
Now run the script.
bash scripts/script_01a.sh
This should give a long list of stuff. We want to paste the timestamps onto the weather data. Open the file again and edit it, redirecting the output and adding the paste command.
nano scripts/script_01b.sh
grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' > 1 grep -h -r -i gaithersburg data/* > 2 paste 1 2
Now it looks nicer.
bash scripts/script_01b.sh
Reduce, reuse, recycle
Rather than edit script files, it would be easier to pass the name of a city into the script to get the data.
nano scripts/script_02a.sh
grep -h -r -i -B100 $1 data/* | grep 'M EDT' > 1 grep -h -r -i $1 data/* > 2 paste 1 2
Now pass a city name as an argument to the script.
bash scripts/script_02a.sh gaithersburg
bash scripts/script_02a.sh leesburg
bash scripts/script_02a.sh manassas
bash scripts/script_02a.sh dulles
Walking the data
Use for loops to generate tables of the data.
nano scripts/script_03a.sh
for city in gaithersburg leesburg manassas dulles do grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1 grep -h -r -i $city data/* > 2 paste 1 2 done
bash scripts/script_03a.sh | sort -k6,6n -k2,2 -k1,1n
nano scripts/script_03b.sh
for city in {gaithersburg,leesburg,manassas,dulles} do grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1 grep -h -r -i $city data/* > 2 paste 1 2 done
bash scripts/script_03b.sh | sort -k6,6n -k2,2 -k1,1n
Parsing the data
while .. read .. line is the way to walk through a single file. The file is redirected into STDIN.
while read line ; do echo $line ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "$line" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "--- $line ---" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
Create a script to do this.
nano scripts/script_04a.sh
while read line do echo "--- $line ---" done
nano scripts/script_04b.sh
while read line do echo "---${line}---" done
Dealing with adversity
Use conditionals to handle unknowns and odd issues.
Only grab the data below "CITY"
nano scripts/script_04c.sh
while read line do if ( echo "$line" | grep -q ^CITY ) ; then good=1 fi if [[ -n $good ]]; then echo " ---${line}---- " fi done
bash scripts/script_04c.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Don't print "CITY"
nano scripts/script_04d.sh
while read line do if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then echo " ---${line}---- " fi done
bash scripts/script_04d.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Get rid of remainder
nano scripts/script_04e.sh
while read line do if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if ( echo "$line" | grep -q "^$" ) ; then exit else echo " ---${line}---- " fi fi done
bash scripts/script_04e.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Yuck, get rid of '$$' as well
nano scripts/script_04f.sh
this won't work
if ( echo "$line" | grep -q "$$" ) ; then exit else
this will, but we're still left with blank lines
if ( echo "$line" | grep -q '$$' ) ; then exit else
more cleverness
if ( echo "$line" | grep -q '$$' ) ; then exit elif ( echo "$line" | grep -q "^$" ) ; then exit else
even more cleverness
if [[ "$line" =~ [[:alpha:]] ]] ; then echo " ---${line}---- " fi
now walk all the files
for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done
Wot? Repeats?
for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done | sort -u
Regular expressions
Huh? Rolling around through the data shows some repeats
---ANNAPOLIS CLEAR 65 48 54 VRB7 29.65F---- ---ANNAPOLIS CLEAR 80 77 90 CALM 29.98R---- ---BWI AIRPORT MOCLDY 70 49 47 S12G20 29.63F---- ---BWI AIRPORT PTCLDY 75 75 100 CALM 29.98R---- ---MD SCIENCE CTR N/A 60 46 59 MISG 29.64F---- ---MD SCIENCE CTR N/A 79 75 87 MISG 29.97R----
Use the Expires tag to filter out very old data:
nano scripts/script_05a.sh
regex="^Expires:([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})" cutoff=201710010000 while read line do if [[ "$line" =~ $regex ]]; then echo year=${BASH_REMATCH[1]} echo month=${BASH_REMATCH[2]} echo day=${BASH_REMATCH[3]} echo hour=${BASH_REMATCH[4]} echo minute=${BASH_REMATCH[5]} if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}${BASH_REMATCH[5]}" ]]; then exit fi fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then echo " ---${line}---- " fi fi done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05a.sh < $file ; done | sort -u
Simpler:
nano scripts/script_05b.sh
regex="^Expires:([0-9]{12})" cutoff=201710010000 while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then exit fi fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then echo " ---${line}---- " fi fi done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05b.sh < $file ; done | sort -u
Automation
Create a function to automate what we did:
nano scripts/script_06a.sh
cutoff=201710010000 function extract_data { regex="^Expires:([0-9]{12})" while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then exit fi fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then if [[ -z $value ]]; then value=${line} else value=${value}$'\n'${line} fi fi fi done < $1 echo "$value" } extract_data $1
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_06a.sh $file ; done | sort -u
That's good, but what if we want all the data for a given time and state?
nano scripts/script_06b.sh
cutoff=201710010000 function extract_data { regex="^Expires:([0-9]{12})" unset good unset value while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then return fi fi if [[ "$line" =~ '$$' ]] ; then unset good fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then if [[ -z $value ]]; then value=${line} else value=${value}$'\n'${line} fi fi fi done < $1 [[ -n $value ]] && echo "$value" } for file in $(find $1 -type f) do extract_data $file done | sort -u
bash scripts/script_06b.sh data/weather-2017-10-10-00-00-02/az
Selectivity and format
Getting back to time stamps. How can we format time?
nano scripts/script_07a.sh
cutoff=201710010000 function month_str_to_num { # Convert month_3char to numeric case $1 in "JAN") echo 1 ;; "FEB") echo 2 ;; "MAR") echo 3 ;; "APR") echo 4 ;; "MAY") echo 5 ;; "JUN") echo 6 ;; "JUL") echo 7 ;; "AUG") echo 8 ;; "SEP") echo 9 ;; "OCT") echo 10 ;; "NOV") echo 11 ;; "DEC") echo 12 ;; *) { echo Bad month format; exit 1; } ;; esac } function timestr_to_clock { local c=$1 local ap=$2 local h="" local m="" if [[ ${#c} == 3 ]]; then h=${c:0:1} m=${c:1:2} elif [[ ${#c} == 4 ]]; then h=${c:0:2} m=${c:2:2} else { echo Bad time format; exit 1; } fi if [[ "$ap" == "PM" ]]; then ((h+=12)) fi echo "$(printf '%02d' $h):$(printf '%02d' $m):00" } function parse_time_stamp { local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" if [[ "$1" =~ $regex ]]; then local hour_min=${BASH_REMATCH[1]} local am_pm=${BASH_REMATCH[2]} local timezone=${BASH_REMATCH[3]} local day_of_the_week=${BASH_REMATCH[4]} local month_3char=${BASH_REMATCH[5]} local day=${BASH_REMATCH[6]} local year=${BASH_REMATCH[7]} local month=$(month_str_to_num $month_3char) local clock=$(timestr_to_clock $hour_min $am_pm) echo "${year}-${month}-${day}T${clock}" else { echo Bad timestamp format; exit 1; } fi } function extract_data { regex="^Expires:([0-9]{12})" timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" unset good unset value unset timestr while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then return fi fi if [[ "$line" =~ $timestamp ]]; then timestr=$(parse_time_stamp "$line") fi if [[ "$line" =~ '$$' ]] ; then unset good fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then if [[ -z $value ]]; then value="${timestr} ${line}" else value=${value}$'\n'${timestr}$' '${line} fi fi fi done < $1 [[ -n $value ]] && echo "$value" } for file in $(find $1 -type f) do extract_data $file done | sort -u
bash scripts/script_07a.sh data/weather-2017-10-06-12-49-01/nc
Simplify. The date command can parse time, somewhat:
nano scripts/script_07b.sh
cutoff=201710010000 function timestr_to_clock { local c=$1 local ap=$2 local h="" local m="" if [[ ${#c} == 3 ]]; then h=${c:0:1} m=${c:1:2} elif [[ ${#c} == 4 ]]; then h=${c:0:2} m=${c:2:2} else { echo Bad time format; exit 1; } fi if [[ "$ap" == "PM" ]]; then ((h+=12)) fi echo "$(printf '%02d' $h):$(printf '%02d' $m):00" } function parse_time_stamp { local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" if [[ "$1" =~ $regex ]]; then local hour_min=${BASH_REMATCH[1]} local am_pm=${BASH_REMATCH[2]} local timezone=${BASH_REMATCH[3]} local day_of_the_week=${BASH_REMATCH[4]} local month_3char=${BASH_REMATCH[5]} local day=${BASH_REMATCH[6]} local year=${BASH_REMATCH[7]} local clock=$(timestr_to_clock $hour_min $am_pm) echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T") else { echo Bad timestamp format; exit 1; } fi } function extract_data { regex="^Expires:([0-9]{12})" timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" unset good unset value unset timestr while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then return fi fi if [[ "$line" =~ $timestamp ]]; then timestr=$(parse_time_stamp "$line") fi if [[ "$line" =~ '$$' ]] ; then unset good fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then if [[ -z $value ]]; then value=${timestr}$' '${line} else value=${value}$'\n'${timestr}$' '${line} fi fi fi done < $1 [[ -n $value ]] && echo "$value" } for file in $(find $1 -type f) do extract_data $file done | sort -u
bash scripts/script_07b.sh data/weather-2017-10-06-12-49-01/nc
Stupid bug
nano scripts/script_07c.sh
cutoff=201710010000 function timestr_to_clock { local c=$1 local ap=$2 local h="" local m="" if [[ ${#c} == 3 ]]; then h=${c:0:1} m=${c:1:2} elif [[ ${#c} == 4 ]]; then h=${c:0:2} m=${c:2:2} else { echo Bad time format; exit 1; } fi if [[ "$ap" == "PM" ]]; then if [[ $h != 12 ]]; then ((h+=12)) fi fi echo "$(printf '%02d' $h):$(printf '%02d' $m):00" } function parse_time_stamp { local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" if [[ "$1" =~ $regex ]]; then local hour_min=${BASH_REMATCH[1]} local am_pm=${BASH_REMATCH[2]} local timezone=${BASH_REMATCH[3]} local day_of_the_week=${BASH_REMATCH[4]} local month_3char=${BASH_REMATCH[5]} local day=${BASH_REMATCH[6]} local year=${BASH_REMATCH[7]} local clock=$(timestr_to_clock $hour_min $am_pm) echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T") else { echo Bad timestamp format; exit 1; } fi } function extract_data { regex="^Expires:([0-9]{12})" timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$" unset good unset value unset timestr while read line do if [[ "$line" =~ $regex ]]; then if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then return fi fi if [[ "$line" =~ $timestamp ]]; then timestr=$(parse_time_stamp "$line") fi if [[ "$line" =~ '$$' ]] ; then unset good fi if ( echo "$line" | grep -q ^CITY ) ; then good=1 continue fi if [[ -n $good ]]; then if [[ "$line" =~ [[:alpha:]] ]] ; then if [[ -z $value ]]; then value=${timestr}$' '${line} else value=${value}$'\n'${timestr}$' '${line} fi fi fi done < $1 [[ -n $value ]] && echo "$value" } for file in $(find $1 -type f) do extract_data $file done | sort -u
bash scripts/script_07c.sh data/weather-2017-10-06-12-49-01/nc
What?
diff scripts/script_07c.sh scripts/script_07b.sh
20,24c20,22 < if [[ "$ap" == "PM" ]]; then < if [[ $h != 12 ]]; then < ((h+=12)) < fi < fi --- > if [[ "$ap" == "PM" ]]; then > ((h+=12)) > fi
Compartmentalization
Our script is getting out of hand. So, create a separate file to hold the functions, then source it:
nano scripts/script_08a.sh
cutoff=201710010000 source scripts/function.sh for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d) ; do for file in $(find $collection/md/ -type f) ; do extract_data $file done | sort -u done
Run it:
bash scripts/script_08a.sh
Kind of messy, add a sort step:
nano scripts/script_08b.sh
cutoff=201710010000 source scripts/function.sh for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do for file in $(find $collection/md/ -type f) ; do extract_data $file done | sort -u done
Run it:
bash scripts/script_08b.sh
Parallelization
It's kind of slow to parse each file, once after another. Instead, let's parse them in parallel:
nano scripts/script_09a.sh
cutoff=201710010000 source scripts/function.sh file_array=() for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do for file in $(find $collection/md/ -type f -name "*.txt") ; do file_array+=($file) done done parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u
Run it:
bash scripts/script_09a.sh
... except that this fails. We need the functions to become elevated to the environment:
nano scripts/script_09b.sh
export cutoff=201710010000 source scripts/function.sh export -f timestr_to_clock export -f parse_time_stamp export -f extract_data file_array=() for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do for file in $(find $collection/md/ -type f -name "*.txt") ; do file_array+=($file) done done parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u
Run it:
bash scripts/script_09b.sh
How much speed up do we get?
time bash scripts/script_08b.sh > /dev/null real 0m9.910s user 0m39.404s sys 0m49.298s
time bash scripts/script_09b.sh > /dev/null real 1m7.579s user 0m37.302s sys 0m46.612s
Not quite 8-fold speed up, but pretty good nonetheless:
echo "scale=2;67.58/9.91" | bc 6.81