Class 5: File types and Emacs (DRAFT)

Table of Contents


Rough plan - you will have to wait for a newer version of this lecture notes to have an introduction.

New IRC server    irc

I am switching the class to use an IRC server at CCOM using http://ngircd.barton.de/ This will give us total control of the channels that we want to have. There are many different ways to connect to the IRC server.

Text based - irssi

You can do this either from inside your ubuntu virtual machine inside the ccom network or by using ssh/putty to connect to researchtools.ccom.nh. If you are using the class virtual machine, first install the irssi IRC client:

sudo apt-get install irssi

Now start irssi and connect:

/connect researchtools.ccom.nh
/join #unhresearchtools

You will need to read the irssi documentation about how to use the client, but the most important command you probably need to start with is how to quit:


Browser based - Chatzilla

Graphical Client - Pidgin

From the virtual machine, run Pidgin. From the menu, it is Applications -> Internet -> Pidgin Internet Messenger. Then from the Buddy List, Accounts -> Manage Accounts. The press the "Add…" button. Under the Add Account "Basic" tab, select the protocol as: IRC. Pick a username (should match your CCOM username), and select the server as researchtools.ccom.nh. There is no password. Leave password blank.

Make sure that the account is "Enabled" (put a check mark in the account).

Then under buddies, select "Join a Chat". You can put in "#unhresearchtools" or you can push the "Room List" button.

The Cloud - Tools on the internet - Bookmarking    bookmarking

The idea of cloud bookmarking is that you save a link in your web browser and you can do more than just use it locally on that computer in the particular web browser that you made the bookmark from. First, if you switch browsers or computers, your bookmarks should follow you. Second, you should be able to share bookmarks with others and also be able to keep a bookmark private if you do not want others to see it (there are many reasons to keep bookmarks private).

There are many services on the internet for storing your bookmarks to web pages in the cloud. The old standby is http://delicious.com. This service was owned by Yahoo for a while, but is now under new management. It is fairly simple and has an easy to use interface for Firefox, Chrome, and IE.


I have been using delicious for a while now. My main page on delicious is:


I have been trying to be good about consistently bookmarks to bring order to the chaos that is the internet. For example, I have a tag for this Research Tools course: <a href="http://www.delicious.com/goatbar/researchtools">researchtools</a>. Anything that I think might be relevant for the class gets that tag. Delicious tries to offer you suggested tags when you mark a page. You can adopt those and/or create your own strategy. I don't know who invented tag clouds, but they can be helpful to understand focus. Take a look at my tag cloud:


WARNING: If you create an account for yourself at any of these bookmarking services, make sure that you create a unique password for the site. Also remember that, even if the service says that a link you say is private, it might not always be secure. Some links should never be put into a linking service.

Both Chrome and Firefox have built in sync services. I do not have much experience with these, but they are likely very good.

There are many many other bookmark services. Some are free, some for pay.


For whichever bookmark server you choose to use, make sure that the service allows you to back up your bookmarks (a.k.a. export) and back them up! For example, here is the export / backup feature for delicious:


SEE ALSO: Comparison of browser synchronizers (Wikipedia)

Loading the sample data

Today, we are going to start exploring data types in Linux. I have put together a collection of various files that we will use to learn how to look at files. We will learn more about many of these file types over the semester. For now, we will only graze the surface of these files.

Open a terminal in your Linux virtual machine. I have created a TinyURL to make it easier to type the whole URL to the file, which is:


We will use the command "wget" to pull the file down in the terminal. This is similar to doing a right-click and "Save Link As" in a web browser.

wget http://tinyurl.com/examples-20110913

100%[==========================================================>] 100,421,141 8.72M/s   in 11s

Take a look at what we have downloaded. First use the list files command, "ls", with a "-l" for a long listing.

ls -l

-rw-r--r-- 1 schwehr schwehr 100421141 2011-09-13 08:52 examples-20110913

If you look at the whole original URL, you will see that we wanted the file to be called "examples-20110913.tar.bz2". We can rename the file using the mv command to move the file to the correct name. Remember that in the shell, you can use the TAB key to complete filenames. If you type "ex" and press the TAB key, it will complete as far as it can. In this case to "example". Press TAB again until it shows you all the options (there is also an examples.desktop directory). Complete the example by adding "-" and pressing tab again to get "examples-20110913".

mv examples-20110913 examples-20110913.tar.bz2

The ".tar.bz2" at the end of the name is a hint at the type of file. First, start from the right with the "bz2". This implied (but does not guarantee) that the file is compressed with the bzip2 program. For now, we don't have to worry about this as the next hint will cover us for now. The ".tar" implies that this is a "tape archive". This is much like a zip file that you may already be familiar with. The idea is that one file acts as a container in while you can stuff a whole bunch of files. You can then move around that single file, email it, etc much easier than you would a whole tree of files.

If you want to learn more about tar, check out the web page for GNU tar and the wikipedia entry on the TAR file format. Or you can use the command line:

tar --help
man tar  # remember that "q" quits out of a man page

The tar program knows how to handle uncompressing certain types of compression and that includes the bzip2 format. We can ask tar to first list the contents of what is inside of the tar. It is safer to look at what is in the tar before unpacking it. If it starts taking a while to list the files, you can break out by pressing Control-C. You will see that with the "C" that appears in the terminal.

tar tfvv examples-20110913.tar.bz2

tar tfvv examples-20110913.tar.bz2
drwxr-xr-x schwehr/schwehr   0 2011-09-10 11:47 examples-20110913/
-rw-r--r-- schwehr/schwehr 48128 2011-09-12 17:34 examples-20110913/Presentation1.ppt
-rwxr-xr-x schwehr/schwehr    32 2011-09-12 17:01 examples-20110913/shell-script.sh
-rw-r--r-- schwehr/schwehr 3715206 2011-09-13 08:11 examples-20110913/0479_20080620_175447_RVCS.all.bz2
-rw-r--r-- schwehr/schwehr  143781 2011-09-12 17:42 examples-20110913/mov02175.mp4
-rwxr-xr-x schwehr/schwehr      46 2011-09-12 17:05 examples-20110913/perldemo.pl

The tar looks good, so go ahead and extract it.

tar xf examples-20110913.tar.bz2

It is time to start examining the example files. A first command to see what is in there is tree. I will not show the test results of tree as the do not reproduce in the text mode of these notes.


That gives you a look at the structure of the directories and gives some hint to file type with the colors. Blue text is directories, red files are compressed, yellow-green are files marked "executable".

Go into the directory and do a long listing, but also add the -h for "human readable file sizes"

cd examples-20110913

ls -l -h 
total 110M
-rw-r--r-- 1 schwehr schwehr 3.6M 2011-09-13 08:11 0479_20080620_175447_RVCS.all.bz2
-rw-r--r-- 1 schwehr schwehr 4.0M 2011-09-06 15:03 13003_1.KAP
-rw-r--r-- 1 schwehr schwehr 7.3K 2011-09-06 15:03 13003.BSB
-rw-r--r-- 1 schwehr schwehr  77K 2011-09-12 14:09 20110912-1801.jpeg
drwxr-xr-x 2 schwehr schwehr 4.0K 2011-09-13 08:18 a-folder
-rw-r--r-- 1 schwehr schwehr 737K 2011-09-12 17:47 bags.sqlite
-rw-r--r-- 1 schwehr schwehr 2.6K 2011-09-12 14:01 delicious.htm
-rw-r--r-- 1 schwehr schwehr  117 2011-09-13 08:34 dos-text.txt
-rw-r--r-- 1 schwehr schwehr    0 2011-09-13 08:18 empty-file
-rw-r--r-- 1 schwehr schwehr 5.1M 2011-09-10 14:25 Field_Procedures_Manual_May_2011.pdf

You will also see colors again for the file types.

The default way to open a file    open

There is a command line program that can attempt to open files based on its best guess for how a file should be opened: xdg-open (or just open on the Mac) This works well for some image types.

xdg-open Field_Procedures_Manual_May_2011.pdf

You might be accustomed to using the "double click" in graphical interfaces, but knowing how to open files in the default application is very helpful for working from a scripting environment.

The NOAA Field Procedures Manual

A quick aside: Stop and take a quick look at the NOAA Field Procedures Manual. There are a number of very helpful documents available for material related to this course. NOAA has put together this document to talk about how they do Hydrographic Surveying. This might be different for you if you do hydrographic surveying for some other organization or surveying for goals other than hydrography. However, this and many documents like it provide excellent reference and background material.

If you are at CCOM, you can see some of the documents we have links to in the wiki:


We will talk more about helpful references and how to manage these documents throughout the rest of the semester.

Using file    file

Now we can try asking the computer more about these files. There is a The file command tries to look at a little bit of the beginning of each file to see if it can figure out what type of data is in that file.

WARNING: Always be aware that file names are just a hint to a file type. Renaming a file to some random characters does not change the contents of the file. Some programs count on the "extensions" on the end of the file name (e.g. ".tar"), but you will find that those are not always consistent with the content of the file.

file *
0479_20080620_175447_RVCS.all.bz2:    bzip2 compressed data, block size = 900k
13003_1.KAP:                          data
13003.BSB:                            ASCII English text, with CRLF line terminators
20110912-1801.jpeg:                   JPEG image data, JFIF standard 1.01
a-folder:                             directory
bags.sqlite:                          SQLite 3.x database
delicious.htm:                        exported SGML document text
dos-text.txt:                         ASCII text, with CRLF line terminators
empty-file:                           empty
Field_Procedures_Manual_May_2011.pdf: PDF document, version 1.4
foo.csv:                              ASCII text

There are a large number of files in this directory, but just look at the first file that comes up:

0479_20080620_175447_RVCS.all.bz2:    bzip2 compressed data, block size = 900k

The file command does not get past the fact that the file is compressed. We need to uncompress all of the files that have been shrunk with bzip2 or another program called gzip. gzip files tend to end in ".gz".

ls -l *.bz2 *.gz
-rw-r--r-- 1 schwehr schwehr  3715206 2011-09-13 08:11 0479_20080620_175447_RVCS.all.bz2
-rw-r--r-- 1 schwehr schwehr 42990137 2011-09-13 08:08 reson7111-201005.s7k.bz2
-rw-r--r-- 1 schwehr schwehr     8868 2010-10-16 12:16 terrain.grd.gz
-rwxr-xr-x 1 schwehr schwehr 15541594 2011-09-13 08:11 y1104-02.segy.bz2

bunzip2 *.bz2
gunzip *.gz

ls -l *.bz2 *.gz

Now when we run file on 047920080620175447RVCS.all, we get:

file 0479_20080620_175447_RVCS.all 
0479_20080620_175447_RVCS.all: data

So, it turns out that file does not know anything about our ".all" file. It's binary data, but we can try some other means to see if we can identify the file… we can look into the file. If we try the pager command less, it will ask us if we are sure we want to look at the binary data. Yes, we would like to take a look.

less 0479_20080620_175447_RVCS.all 
"0479_20080620_175447_RVCS.all" may be a binary file.  See it anyway?

Answer yes, and you will see lots of weird stuff with some characters in there. Type a "q" to quit out of less and we will try another helpful command to try to hide the "noise" of the binary data and see if there is any useful text in the file. The command strings will go through a file and return only readable characters from a file. We can see how many strings there are first by "piping" the output from strings to a command called word count (wc).

strings 0479_20080620_175447_RVCS.all | wc
  21912   26248  143940

The vertical bar is the "pipe" command. wc tells us the number of lines followed by the number of words in the middle and the number of characters on the right. 21 thousand matches is too many to look at, so we should just look at the first few strings returned using the head command (there is also an equivalent tail command for the other end of the stream).

strings 0479_20080620_175447_RVCS.all | wc
  21912   26248  143940
schwehr@ubuntu:~/examples-20110913$ strings 0479_20080620_175447_RVCS.all | head
WLZ=0.57,SMH=481, ... DSV=3.0.7 040104,SID=Hydro2008_Day2,COM=Day1 of Summer Hydro 2008. 
" xj

I have replaced a whole lot of text with "…" above, but this first line returned tells us quite a bit about this file. From this, I can know that the file has something to do about the CCOM Summer Hydro class in 2008. We don't have anything that tells us much more, but now we have something we can use for a web search or to ask around about the Summer Hydro class.

using grep to search for strings    grep

We can use a program called "grep" to search for text patterns in the results of the file command. The pictures in the file return a line that contains the word "image". For example, I found this JPEG image:

20110912-1801.jpeg:                   JPEG image data, JFIF standard 1.01

We can pass the output through a pipe to the grep command and ask it to search for the word "image".

file * | grep image
20110912-1801.jpeg:                   JPEG image data, JFIF standard 1.01
H11296_5m-hillshade.tif:              TIFF image data, little-endian

imagemagick to examine images

We have a JPEG image and a tiff image. That doesn't tell us much about those images, but we can find out more with other tools. The first tool that we will use is ImageMagick. It comes with a command called identify.

identify  *.jpeg *.tif 
20110912-1801.jpeg JPEG 1280x960 1280x960+0+0 8-bit DirectClass 78.4KB 0.000u 0:00.000
H11296_5m-hillshade.tif[1] TIFF 3205x1278 3205x1278+0+0 8-bit Grayscale DirectClass 4.107MB 0.000u 0:00.000
identify: H11296_5m-hillshade.tif: unknown field with tag 33550 (0x830e) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/704.
identify: H11296_5m-hillshade.tif: unknown field with tag 33922 (0x8482) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/704.
identify: H11296_5m-hillshade.tif: unknown field with tag 34735 (0x87af) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/704.

Ignore the warnings with "unknown field" in them. This is a geotiff with information that ImageMagick does not know about. identify has told us the size of the images a little about the type of content. Better yet to take a look at the images.

display  *.jpeg *.tif 

The jpeg turns out to be an image from the camera above the bridge of the USCG Ice Breaker Healy. The tiff looks weird, but is actually a gray scale image of some Lidar data from the area near the UNH campus.

Using gdal to ask more about the images    gdal

The Geospatial Data Abstraction Library ( GDAL ) library, has tools for identifying both raster (e.g. images) and vector (e.g. line) data that has spatial data attached to it. First try gdalinfo on the JPEG image:

gdalinfo 20110912-1801.jpeg 
Files: 20110912-1801.jpeg
Size is 1280, 960
Coordinate System is `'
  EXIF_GPSVersionID=0x2 0x2 00 00
  EXIF_GPSLatitude=(81) (14) (24.5814)
  EXIF_GPSLongitude=(126) (47) (30.5148)
Image Structure Metadata:
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  960.0)
Upper Right ( 1280.0,    0.0)
Lower Right ( 1280.0,  960.0)
Center      (  640.0,  480.0)
Band 1 Block=1280x1 Type=Byte, ColorInterp=Red
  Image Structure Metadata:
Band 2 Block=1280x1 Type=Byte, ColorInterp=Green
  Image Structure Metadata:
Band 3 Block=1280x1 Type=Byte, ColorInterp=Blue
  Image Structure Metadata:

JPEG images have special data called "EXIF" tags that can record more than just the image. In this case it has save the GPS location of the ship from when the picture was taken. The ship was at roughly 81 North and 126 West when the picture was taken. That is way up in the Arctic!

Now take a look at the results for the "GeoTiff":

gdalinfo H11296_5m-hillshade.tif 
Driver: GTiff/GeoTIFF
Files: H11296_5m-hillshade.tif
Size is 3205, 1278
Coordinate System is:
        SPHEROID["WGS 84",6378137,298.2572235629972,
Origin = (-70.778190923162583,43.023240535377035)
Pixel Size = (0.000058406003444,-0.000058406003444)
Image Structure Metadata:
Corner Coordinates:
Upper Left  ( -70.7781909,  43.0232405) ( 70d46'41.49"W, 43d 1'23.67"N)
Lower Left  ( -70.7781909,  42.9485977) ( 70d46'41.49"W, 42d56'54.95"N)
Upper Right ( -70.5909997,  43.0232405) ( 70d35'27.60"W, 43d 1'23.67"N)
Lower Right ( -70.5909997,  42.9485977) ( 70d35'27.60"W, 42d56'54.95"N)
Center      ( -70.6845953,  42.9859191) ( 70d41'4.54"W, 42d59'9.31"N)
Band 1 Block=3205x2 Type=Byte, ColorInterp=Gray
  NoData Value=0

The results for the JPEG picture were different than for this GeoTiff. The picture is located just at one point on the Earth. In the tif, we see that there is a Coordinate System definition (WGS84) and that there 4 corners of the rectangle defined to say where this area is (Newcastle, NH to the Isle of Shoals).

Creating a script

Now we would like to start creating a script to be able to record and rerun basic operations that we have been doing on the command line.

First you can review what we have done today in the shell with the history command. Your results will look different than what I have here:

420  gunzip *.gz
422  bunzip2 *.bz2
423  file 0479_20080620_175447_RVCS.all 
424  less 0479_20080620_175447_RVCS.all 
426  strings 0479_20080620_175447_RVCS.all | wc
427  strings 0479_20080620_175447_RVCS.all | head
428  file *
433  file * | grep image
438  identify  *.jpeg *.tif 
439  display  *.jpeg *.tif 
440  gdalinfo 20110912-1801.jpeg 
441  ls *.tif
442  gdalinfo H11296_5m-hillshade.tif 
443  history

If we want to make it easy to rerun some commands, we will want to start creating "scripts".

Emacs - a powerful text editor

First, we should add GNU Emacs to the quick start short cut location on the toolbar at the top of the screen.

Right click on the gray bar, and select "Add to Panel…". Double click "Application Launcher… Copy a launcher from he applications menu". Press the small arrow to the left of "Accessories" to expand the items inside of accessories. Double click on "GNU Emacs 23". Click "Close" to stop adding programs to the "Panel".

Click the "E" icon to start Emacs. You are now faced with a very powerful tool that can handle very complicated editing jobs. Emacs is very complicated, but it is worth the time to learn.

On your own time, I strongly suggest that you go through the "Emacs Tutorial" and "Emacs Guided Tour". Emacs has key strokes that you can use for every command, but starting off, you can use the menus at the top.

To start with, open a new file. "File" -> "Visit New File". Click the "Browse for other folders" arrow to see directories. Double click "examples-20110913". Now type in "my-script.sh" into the name.

You are now looking at a blank space where you can create your script. Start off by creating a very basic script to print on line of text just to get started:

echo "hello from my script"

Save this file "File" -> "Save".

Now switch back to your terminal and take a look at the file your created. It will be the newest file, so we can do a list files, but add sort by time (-t) and reverse the sort order (-r) to put the newest files at the bottom. I will pipe the output through tail to give us only the last 10 lines.

ls -ltr | tail
-rw-r--r-- 1 schwehr schwehr    46393 2011-09-13 08:29 sample-audio.m4a
-rw-r--r-- 1 schwehr schwehr    70310 2011-09-13 08:30 sample-audio.mp3
-rw-r--r-- 1 schwehr schwehr   416460 2011-09-13 08:31 sample-audio.wav
-rw-r--r-- 1 schwehr schwehr   208512 2011-09-13 08:31 sample-audio.ac3
-rw-r--r-- 1 schwehr schwehr       78 2011-09-13 08:33 foo.csv
-rw-r--r-- 1 schwehr schwehr      117 2011-09-13 08:34 dos-text.txt
-rw-r--r-- 1 schwehr schwehr      294 2011-09-13 08:36 sample.org
-rw-r--r-- 1 schwehr schwehr      980 2011-09-13 08:36 sample.tex
-rw-r--r-- 1 schwehr schwehr    64942 2011-09-13 08:37 sample.pdf
-rw-r--r-- 1 schwehr schwehr       28 2011-09-13 10:38 my-script.sh

You should see my-script.sh. Take a look the file. We can use cat to take a look.

Author: Kurt Schwehr

Date: <2011-09-13 Tue>

HTML generated by org-mode 7.4 in emacs 23