#+STARTUP: showall

#+TITLE:     Class 8: More emacs and script files (DRAFT)
#+AUTHOR:    Kurt Schwehr
#+EMAIL:     schwehr@ccom.unh.edu
#+DATE:      <2011-09-22 Thu>
#+DESCRIPTION: Marine Research Data Manipulation and Practices
#+KEYWORDS: emacs, org-mode
#+LANGUAGE:  en
#+OPTIONS:   H:3 num:nil toc:t \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
#+OPTIONS:   TeX:t LaTeX:nil skip:t d:nil todo:t pri:nil tags:not-in-toc
#+INFOJS_OPT: view:nil toc:nil ltoc:t mouse:underline buttons:0 path:http://orgmode.org/org-info.js
#+LINK_HOME: http://vislab-ccom.unh.edu/~schwehr/Classes/2011/esci895-researchtools/

# * todo
# * todo items

# - Why am I teaching these tools? http://slashgeo.org/2011/09/16/FOSS4G-2011-Brian-Timoney-GeoSpatial-One-Stop-Irene-USA
# - create homework
# - http://manpages.ubuntu.com and use the ones for 11.04
# - http://stackoverflow.com/
# - finding out about software in Ubuntu
# - How to create a script
# - Getting images from the healy


* Introduction

* Getting help                                                :stackoverflow:

http://stackoverflow.com/questions/7431167/escaping-org-mode-example-block-inside-of-an-example-block

* Videos on Emacs

Sadly, http://showmedo.com does not have any good videos on the basics
of using emacs.  You might find the [[http://www.youtube.com/user/rpdillon#g/u][Hack Emacs]] videos on YouTube by
rpdillon useful for getting more comfortable with emacs.

* Creating a log file                                               :logging:

For the rest of the semester, you need to keep a log file for this class.

mkdir ~/Dropbox/logs


* Working with the ocean drilling projects site database                :odp:

#+BEGIN_SRC sh
mkdir -p class/8
cd class/8
#+END_SRC

Today, we will use a program very similar to wget called [[http://curl.haxx.se/][curl]]
[[http://manpages.ubuntu.com/manpages/natty/en/man1/curl.1.html][(curl.1 man page]]) to fetch data.

#+BEGIN_SRC sh
sudo apt-get install curl

curl -O http://vislab-ccom.unh.edu/~schwehr/Classes/2011/esci895-researchtools/examples/holes.csv.bz2
# /esci895-researchtools/examples/holes.csv.bz2
#  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#                                  Dload  Upload   Total   Spent    Left  Speed
# 100 38953  100 38953    0     0   224k      0 --:--:-- --:--:-- --:--:--  358k
#+END_SRC

Uncompress and take a first look at this file:

#+BEGIN_SRC sh
bunzip2 holes.csv.bz2

wc -l holes.csv 
# 3047 holes.csv

head holes.csv 
#+END_SRC

The beginning of the holes.csv file looks like this:

#+BEGIN_EXAMPLE 
Expedition,Site,Hole,Program,Longitude,Latitude,Water Depth (m),Core Recovered (m)
1,1,,DSDP,-92.1833,25.8583,2827,50
1,2,,DSDP,-92.0587,23.0455,3572,13
1,3,,DSDP,-92.0433,23.03,3747,47
1,4,,DSDP,-73.792,24.478,5319,15
1,4,A,DSDP,-73.792,24.478,5319,5.8
1,5,,DSDP,-73.641,24.7265,5354,6.4
1,5,A,DSDP,-73.641,24.7265,5354,1.8
1,6,,DSDP,-67.6477,30.8398,5124,28
1,6,A,DSDP,-67.6477,30.8398,5124
#+END_EXAMPLE

Now we are going to use a program called [[http://manpages.ubuntu.com/manpages/natty/en/man1/cut.1.html][cut]] to try to extract
the "Program" column of the file.  You can see above in the comma
separated value (CSV) formatted data that there is at least a "DSDP",
which is the [[http://en.wikipedia.org/wiki/Deep_Sea_Drilling_Program][Deep Sea Drill Program]] that ran from 1968 to 1983.  Cut
can work a couple different ways, but here we are going to ask it to
work in "field mode" and tell it that commas (",") are the delimiter
(or separator) between fields.  We do that with a "-d" and the comma
character.  We then specify the number of the field we want.  Looking
at the first line of the file, you can see that "Program" appears in
the fourth position.

#+BEGIN_SRC sh
cut -d, -f4 holes.csv | head
# Program
# DSDP
# DSDP
# DSDP
# DSDP
# DSDP
# DSDP
# DSDP
# DSDP
# DSDP
#+END_SRC

When you run the above command, you will only see the first 10 lines
on the screen. That is not very helpful. We would like to see how many
unique entry types there are. The [[http://manpages.ubuntu.com/manpages/natty/en/man1/uniq.1.html][uniq]] command removes duplicates in
the lines of text that it receives.

#+BEGIN_SRC sh
cut -d, -f4 holes.csv | uniq
# Program
# DSDP
# ODP
# IODP
#+END_SRC

Next, let's see how many lines there are for each program.  We can
pass the output of the grep to the word count program we used before.
=wc= has an option to only print the number of lines, so we will
add "-l" to the command line.  

The data gets passed from one program to another by a *pipe*.
What goes in one side, comes out the other.  A pipe is created by the
vertical bar character: "|".

#+BEGIN_SRC sh
egrep DSDP holes.csv | wc -l  # the letter "l" as in Lima, not the number 1
# 1116

egrep ODP holes.csv | wc -l
# 1930

egrep IODP holes.csv | wc -l
# 153
#+END_SRC

We have a slight problem here in that the counts are not adding up.
The string ODP is found in both the ODP and IODP entries. Here I am
using the "binary calculator" ([[http://manpages.ubuntu.com/manpages/natty/en/man1/bc.1.html][bc.1 man page]]) to do a little math. I
suspect you can just do this by hand, but the example shows another
pipe.

#+BEGIN_SRC sh
# The 3 results from the word counts above
echo  "1116 + 1930 + 153" | bc
# 3199

# That adds up to more than the number of lines in the file
wc -l holes.csv
# 3047 holes.csv
#+END_SRC

We can use the "," that precedes the ODP to help avoid the IODP.

#+BEGIN_SRC sh
egrep 'ODP' holes.csv  | wc -l
# 1930

egrep ',ODP' holes.csv  | wc -l
# 1777
#+END_SRC

There are lots of other ways that we could have solved this, but this
way is pretty simple compared to some of the others.

*  Writing results to a file and making a quick plot with Gnuplot :gnuplot:redirection:

It is always important to get a graphical view of spatial data.  Later
in this chapter, we will start using Google Earth and in a future
chapter, we will load our data into a Geographical Information System
(GIS).  For now, we will draw the locations with [[http://www.gnuplot.info/][Gnuplot]].  This
graphing program is not as flexible as matplotlib that we will cover
in the programming in Python chapters, but it can definitely get the
job done.

Gnuplot works most easily with files that have space delimited rather
than comma delimited text data values.  We need to pull out the
longitude and latitude values from the holes.csv file.  We can start
back with the cut command that we used before.  This time we will give
it two different fields in the csv to print with "-f5-6".  This means
we are asking for fields 5 through 6.  We could also have said
"-f5,6", which would be fields 5 and 6.

#+BEGIN_SRC sh
cut -d, -f5-6 holes.csv | head
#+END_SRC

#+BEGIN_EXAMPLE
Longitude,Latitude
-92.1833,25.8583
-92.0587,23.0455
-92.0433,23.03
-73.792,24.478
-73.792,24.478
-73.641,24.7265
-73.641,24.7265
-67.6477,30.8398
-67.6477,30.8398
#+END_EXAMPLE

Gnuplot will get confused by the "Longitude,Latitude" strings on the
first line.  We can get rid of this line with the egrep command.
Normally, egrep returns the lines that match, be we can ask it to
return all lines that do not match by giving it the inverse option of
"-v".  We then give it string "Longitude" to match and it returns all
lines that do not match.

#+BEGIN_SRC sh
egrep -v Longitude holes.csv | cut -d, -f5-6 | head
#+END_SRC

#+BEGIN_EXAMPLE
-92.1833,25.8583
-92.0587,23.0455
-92.0433,23.03
-73.792,24.478
-73.792,24.478
-73.641,24.7265
-73.641,24.7265
-67.6477,30.8398
-67.6477,30.8398
-68.2967,30.134
#+END_EXAMPLE

The output above is pretty close to being usable, but we have a ","
characters between each longitude and latitude.  We can use the
[[http://manpages.ubuntu.com/manpages/natty/en/man1/tr.1.html][tr]] (translate) command to exchange the "," for a " " (space).
Make sure to place the =tr= after the =cut= command or cut
will not be able to tell the comma separated fields apart.

#+BEGIN_SRC sh
egrep -v Longitude holes.csv | cut -d, -f5-6 | tr "," " " | head
#+END_SRC

#+BEGIN_EXAMPLE
-92.1833 25.8583
-92.0587 23.0455
-92.0433 23.03
-73.792 24.478
-73.792 24.478
-73.641 24.7265
-73.641 24.7265
-67.6477 30.8398
-67.6477 30.8398
-68.2967 30.134
#+END_EXAMPLE

This is the format that we need for Gnuplot, but we need the longitude
and latitude lines saved to a file.  The ">" (great than character)
"redirects" the output from the last program in the chain of pipes to
a file that is named after the ">".  Be warned that ">" will overwrite
a previous file with the same name if one existed.  First, try a
simpler example to see ">" in action.  Here, I also use the *cat*
(concatenate and print files) command to dump the contents of the
"listing" file to the terminal.  *cat* is much simpler than
*less*, but if a file is very long or you are not sure how long
the file is, you are better off using *less*.

Note: ">>" appends to a file if it already exists or create a new file
when needed, whereas ">" will clobber a file in one already exists.

#+BEGIN_SRC sh
ls -la > listing

# You output may be different depending on the files you have in your
# current directory
cat listing
#+END_SRC

#+BEGIN_EXAMPLE
ls -l
total 124
-rw-r--r-- 1 researchtools researchtools 125861 2011-09-22 04:46 holes.csv
#+END_EXAMPLE

Now that you know how to redirect the output to a file, send the
results of the chain of pipes consisting of =egrep=, =cut=,
and =tr= to the file "xy.dat".

#+BEGIN_SRC sh
egrep -v Longitude holes.csv | cut -d, -f5-6 | tr "," " " > xy.dat

head xy.dat
#+END_SRC

#+BEGIN_EXAMPLE
-92.1833 25.8583
-92.0587 23.0455
-92.0433 23.03
-73.792 24.478
-73.792 24.478
-73.641 24.7265
-73.641 24.7265
-67.6477 30.8398
-67.6477 30.8398
-68.2967 30.134
#+END_EXAMPLE

It is time to give gnuplot a quick try.  This does not give you much
of a sense of what =gnuplot= can do, but we can at least look at the
locations of the cores.  

Note for Cygwin users:  You must be running a shell through X11 to be
able to plot with Gnuplot.  If you are on Linux or Mac, this should
just work with a graph popping up on your screen.

#+BEGIN_SRC sh
gnuplot
plot 'xy.dat'
# There should be a plot of the data on your screen.
quit
#+END_SRC

That looks really wrong!  Check it out with the =GMT minmax= command
from the homework:

#+BEGIN_SRC sh
GMT minmax xy.dat
#+END_SRC

This looks very wrong!!

#+BEGIN_EXAMPLE 
xy.dat: N = 3046	<-179.5558/179.738>	<-77.4413/5736.4>
#+END_EXAMPLE

A latitude higher than 90 North is definitely wrong.  Let's constrain
the plot to the glob and see what we get.

#+BEGIN_SRC sh
gnuplot
set yrange [-90:90]
plot 'xy.dat'
quit
#+END_SRC

To get this database to work, we will clearly need to do some fixing
of problems.  Lesson:

#+BEGIN_VERSE 
Real data has real warts.
#+END_VERSE

This will be the last time that we use =gnuplot=.  We will do the rest
of our plotting using matplotlib in python!

You can see examples of the wide range of plots that can be made with
Gnuplot here:

http://www.gnuplot.info/screenshots/

* Creating a Google Earth KML                               :googleearth:kml:

Now we are going to create our first KML file.  We are going to cheat
a bit and not try to understand the file format, but this will at
least show you how easy it can be.

First, get the header and footer text for the KML line format:

#+BEGIN_SRC sh
curl -O http://vislab-ccom.unh.edu/~schwehr/Classes/2011/esci895-researchtools/google-earth-line-start.kml
curl -O http://vislab-ccom.unh.edu/~schwehr/Classes/2011/esci895-researchtools/google-earth-line-end.kml
#+END_SRC

These two pieces give you the front and back of the KML and all we
need to do is provide the coordinates for the 

Get the coordinates file from the Boston Construction file used during
the homework:

#+BEGIN_SRC sh
curl -O http://vislab-ccom.unh.edu/~schwehr/Classes/2011/esci895-researchtools/examples/2007-boston-construction.csv.bz2

bunzip2 2007-boston-construction.csv.bz2
#+END_SRC

Take a look at the file:

#+BEGIN_SRC sh
head 2007-boston-construction.csv 
#+END_SRC

#+BEGIN_EXAMPLE 
-70.5014566667,42.1006833333,1179617934
-70.5016466667,42.101755,1179617991
-70.501845,42.1028766667,1179618051
-70.5020833333,42.1039,1179618111
-70.5022083333,42.1049116667,1179618176
-70.5022883333,42.1059316667,1179618233
-70.502515,42.1069266667,1179618296
-70.5027566667,42.10796,1179618356
-70.5028616667,42.1090066667,1179618416
-70.5029816667,42.1102133333,1179618486
#+END_EXAMPLE

We can reuse the cut command to get just the X and Y coordinates:

#+BEGIN_SRC sh
cut -d, -f1,2 2007-boston-construction.csv | head
-70.5014566667,42.1006833333
-70.5016466667,42.101755
-70.501845,42.1028766667
-70.5020833333,42.1039
-70.5022083333,42.1049116667
-70.5022883333,42.1059316667
-70.502515,42.1069266667
-70.5027566667,42.10796
-70.5028616667,42.1090066667
-70.5029816667,42.1102133333
#+END_SRC

We are lucky!  KML expects coordinates to come as x,y,z or x,y

#+BEGIN_SRC xml
  <Placemark>
    <LineString>
      <coordinates>
        -125.810021667,48.4840316667
        -125.810295,48.483705
      </coordinates>
    </LineString>
  </Placemark>
#+END_SRC

Let's create the x,y pairs in a file:

#+BEGIN_SRC sh
cut -d, -f1,2 2007-boston-construction.csv > 2007-boston-construction.xy
#+END_SRC

We can now put the header, points and tail together to create a KML
file.  Google Earth has trouble with lines with too many points in
them, so we will use head to only output some of the points.

#+BEGIN_SRC sh
cat        google-earth-line-start.kml >  2007-boston-construction.kml
head -1000 2007-boston-construction.xy >> 2007-boston-construction.kml
cat        google-earth-line-end.kml   >> 2007-boston-construction.kml
#+END_SRC

* Creating a script

There are some key tricks to understanding variables in bash.  First,
you must have no spaces before or after the equal sign.  Bash is very
picky about this.  The other part is where your variable is available.
Without the *export*, the variable is not available to other programs
that are called from the command line.  For us, right now, the export
is not important, but later on for things like the PATH variable that
control where to look for programs, *export* is essential.

To demonstrate variables, we will use the *echo* command which
will just print out to the screen whatever we pass to it.  Give it a
try.  The "$" character starts the use of a variable.

#+BEGIN_SRC sh
# Set a variable
testing=123

# Print the variable
echo $testing
# 123

# Start a new bash shell inside the original one
bash

# See that "testing" is not set.  If there is no variable, bash gives
# an empty string
echo $testing

# quit back to the main bash shell
exit

# Set testing to have a value that will be inherited
export testing="hello world"

bash

# Now see that the exported variable went through
echo $testing
# hello world
#+END_SRC

How can we use a variable to help out?  What if we want to download
one image every hour from one day on the USCGC Healy?  Here is the
2010 set of images for the Healy:

http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/

Open emacs open a file called "healy.sh" and start typing:

# for hour in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23

#+BEGIN_SRC sh
for hour in 01 02 03 04 05 06 07 
do
  echo $hour
done
#+END_SRC

Try running that from the terminal.

#+BEGIN_SRC sh
source healy.sh
#+END_SRC

You should see:

#+BEGIN_EXAMPLE 
01
02
03
04
05
06
07
#+END_EXAMPLE

Now we can try to construct a curl command in the echo.

#+BEGIN_SRC sh 
for hour in 01 02 03 04 05 06 07 
do
  echo curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-${hour}01.jpeg
done
#+END_SRC

Try it and you should see:

#+BEGIN_EXAMPLE 
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0101.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0201.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0301.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0401.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0501.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0601.jpeg
curl -O http://mgds.ldeo.columbia.edu/healy/reports/aloftcon/2010/20101009-0701.jpeg
#+END_EXAMPLE