Intro to Scientific Programming and Data Processing

Kurt Schwehr

$Id: intro-programming.html,v 1.20 2005/10/16 14:59:41 schwehr Exp $

Oct 2005 UPDATE: - Hi All, I have not had any time to work on this document in a long time. If you have an improvement and want to contribute, that is great, but I won't get to make any additions myself until after I finish my PhD thesis. But I will incorporate things that people send me (thanks Joost!)

Ah yes... it turns out that trying to write a large document in straight html is no fun. This thing is getting painful.


NOTE: Perhaps I should be writing this in docbook format instead of raw html circa 1994!

The idea is that this book would be for an introductory text for a class that I would like to teach and that once students have gone through such a class, the text would be starting point to get them started in data processing. It is essential that this book not be set in stone. Paper editions are fine, but this topic will never end. For most projects you will want to go out and find other resouces after you have worked through the introductory material. This book will not make you a master programmer, but hope will give you the tools to process some data and spend more time on your research and maybe even save you from Fortran. Hopefully you will have tools and some background from here that you can use to build your skill set upon in your personal direction.

Table of Contents

FIX: this needs a little structure so it is not such a blob.
  1. Quick Start
  2. Introduction
  3. Why Open Source Software
    1. Cost and Relearning
    2. The Scientific Method
  4. What books/references to buy?
  5. Packages for common tasks
  6. Choosing which Operating System
  7. Choosing which Programming Language(s)
  8. Finding software
  9. Useful commercial software that I have ignored
  10. Using the bash shell and Essential Unix commands
  11. Editing with Emacs
  12. Revision Control with RCS and CVS
  13. Document what you do and work on
  14. Where to get help?
  15. Basic scripting with bash
  16. Makefiles and compiling
  17. Documenting your code with doxygen
  18. General C++
  19. C++ Standard Template Library
  20. C++ Complex Template
  21. C++ String Template
  22. C++ Vector Template and File Parsing
  23. Linking to Fortran 77 code
  24. Python
  25. Python unittest
  26. Python pydoc
  27. SQL (Structured Query Language) using sqlite
  28. Basic HTML
  29. Command line arguments - Using gengetopt
  30. Making manual pages for programs - help2html
  31. autoconf - mastering the configure beast
  32. GNU Scientific Library
  33. Parellel Processing
  34. How to write a bug report
  35. Creating 3D geometry with OpenInventor/Coin

Larger systems/packages - Volume 2?

  1. gnuplot
  2. octave - matlab like system
  3. gdl - similiar to IDL
  4. r
  5. open dx

Processing datasets by example

Quick Start

Okay, so let's make a couple quick programs to quick do some stuff. Anything will do! So open a shell and start typing... We will first use some shell tricks to avoid using a text editor. Start by compiling and running hello world in C and then C++

	cat << EOF > first_c.c

	int main (int argc, char *argv[]) {
	  printf ("Hello World\n");
	  return (EXIT_SUCCESS);  /* C style comment */

	gcc -c first_c.c -o first.c -g -Wall
You should then see:
	Hello World
Time to do the similiar thing in c++
	cat << EOF > first_cplusplus.C
	using namespace std;

	int main (int argc, char *argv[]) {
	  cout << "Hello World" << endl;
	  return (EXIT_SUCCESS);  // C++ style comment

	g++ -c first_cplusplus.C -o first_cplusplus -g -Wall
This will again print out:
	Hello World
Now a quick jump to a more complicated example in c++ where we load some data, sort it, find the smallest and largest, and the sum. It uses the standard template library (STL) vector data type.
	cat << EOF > 2_complicated.C
	using namespace std;

	int main (int argc, char *argv[]) {
	  vector data;
	  int new_value;
	  while (cin >> new_value) {

	  cout << "This is what you entered:" << endl;
	  for (size_t i=0; i::iterator i=data.begin(); i!=data.end();i++) {
	    cout << *i << endl;

	  cout << "Minimum value: " << *(data.begin()) << endl
	       << "Maximum value: " << *(data.end()-1) << endl;
	  cout << "Minimum value: " << data[0] << endl
	       << "Maximum value: " << data[data.size()-1] << endl;

	  return (EXIT_SUCCESS);

	cat << EOF > data.int

	make 2_complicated CXXFLAGS="-g -Wall"

	./2_complicated < data.int 
Here is what you should get back:
	This is what you entered:
	This is the data sorted
	Minimum value: 1
	Maximum value: 8
	Minimum value: 1
	Maximum value: 8
That will give you a couple programs to look at and see run just to get you started for those who like to jump right in. The last example is much more advanced, so do not worry if it looks kind of crazy.


Check out the C tutorial by Peter Shearer: http://mahi.ucsd.edu/shearer/COMPCLASS/c.txt C is a subset of C++ about 99% of the time. So you can use all of that document for more help.

This text will hopefully become a beginning tutorial to programming for geology and geophysics students. There are an infinite number of ways to approach this topic, so this will reflect my take on how new students should approach learning how to write programs that will help with data reduction, analysis, and presentation.

For this document, I will focus on unix style software on Mac OSX. This will be applicable to working with Linux, NetBSD, FreeBSD, SGI's IRIX, Sun's SunOS/Solaris, and cygwin. There may be differences if you are using a system other than Darwin/Mac OSX and it will be up to you to adapt the material here to those systems. As a scientist or engineer (I presume that's who you are if you are reading this), I presume that your goal is to make descoveries and be able to support and prove your results. A big part of the scientific method is create reproduceable results. "It works for me" is definitely not good enough. Just ask the cold fusion folks from the 90's (FIX: fact check the year). With data analysis and interigation, you can strive to make the process repeatable. It may not always be possible, but that should be the target. If you can give a tar archive of raw data and scripts to someone, they should be able to completely follow what you did and end up with the same results. Other may not be able get the same raw data. For example, if your study is on the measurements of a particular supernova and no one else used the same type of instrument, it will be impossible to go back in time. However, what you do with the measurements needs to be repeatable. A part of this is to try to avoid GUI type systems whenever possible or to take note of all the parameters and methods used throughout. An example of non-repeatable processing right now that is currently essential is swath sonar ping editing. We can all apply the same exact mbclean to the data, but when it comes to deleting bad pings by hand, two people may delete 90% of the same pings, but that 10% is a judgement call that people do differently.

Why Open Source Software

There is often a tug of ware between commercial software and free software. You are almost always have a restricted budget (wow is it crazy when you don't!). So choices must be made as to where to put the money. You will have to balance between these factors some of which are listed here: Do you buy the commercial software? Hire a consultant to adapt open source software to your needs? Pay the commercial vender to add a needed feature? Write something in house from scratch? Always tough choices. Buy a better computer or optimize the software?

Cost and Relearning

As a practical measure, I will try to avoid commercial software as much as possible. Each scientist will have a different budget situation which may preclude purchasing and maintain expensive software packages and this situation can vary dramatically during your career. By choosing free and open software, I hope to maximize the number and quality of tools available to you while minimizing the number times that you will have to learn a new tool to re-solve an old problem. Early in my career, I was able to use a number of expensive software tools and libraries. I then tried to convince a number of univerities to use my software that I built on top of them. Their response was that software licensing to use the libraries and tools needed by my free code would cost more than their graduate students cost them. As a result that software was long ago shelved never to see the light of day again.

Then there are the days that you find out some critical piece of commercial software you now rely on is gone. The reasons for this happening are numerous. For example:

When you are working with open source software, you have the option to do with the code as you need. If you need to hire a software engineer to maintain some old piece of critical code, you have the ability to do so. If the vendor is gone, you, a colleague, or the community can take responsibility for a body of code.

The Scientific Method

Parallel to the arguments of cost and troubles with commercial vendors is the scientific method. Other scientists need to be able to reproduce your results and know exactly how the data were processed. There is nothing like a binary only software package or library to hide what is really going on. In what version of the software was a critical bug really fixed? With open source, you know you have the option to see how the algorithm works under the hood. You probably really don't want to see in there, but when the day comes that you must, you have the option. When I think there is something wrong inside of Matlab or IDL, I can only bug the vendor and cross my fingers that they will give me a decent answer.

I think this section deserves an entire essay, but that is all I will say for now.

What books/references to buy?

It would be nice to say that everything you need is online in electronic form, but that just isn't so. Sometimes you just have to go with dead trees. Computer screens still don't have the utility of a good book. Buying books is also a great way to support authors who put huge amount of energy in to writing software and documenting it. More advanced books:

Packages for common tasks

This section will talk about what programs you can use for which tasks.

Choosing which Operating System

This text focuses on Mac OSX 10.3 and newer.

Often this will not be your choice. You are stuck with what you have for any number of reasons or you don't want to learn anything new. If you have the opportunity to switch, here is my take on the options.

Caution! Opinionated sections! Not that the rest of this text isn't heavily laced with oppinions.

Finding software

So you have a shinny new computer or inhereted some old beast from the dark ages of the last century. How do you find the software to make it do your thesis for you???
  1. Fink - Only for Mac OSX. FinkCommander is the bomb. I would not be right to leave out DarwinPorts
  2. Fresh Meat - summaries and searches for software.
  3. FSF Free Software Directory - The home of GNU and Richard Stallman
  4. VersionTracker - More commercially oriented, but some free software
  5. Yes, there is Google too.

Choosing which Programming Language

There a hundreds of programming languages out there. Here are my oppinions on programming languages. There are many reasons for choosing so talk my list and descriptions with a salt dome.

First, if you want to see the simplest of comparisons between languages, look at the hello world page. This page has about 200 programming languages. You'll see pretty quickly that is is missing many a language. For example, there is no Arc Macro Language (AML) that is used by ESRI's Arc/Info.

If you want or must choose a different set of programming languages that described here, you will need to go get some different docs. This is not necessarily a bad thing, just you won't get much help beyond the introduction section. Common programming languages and eventually my take on each one. They have many strengths and weakness. My main philosophy is to not lock yourself into one platform. Learn general skills that apply no matter what you end up doing. If you just learn the Microsoft world of VC++, C#, and VB, you are missing out.

That is enough languages for now. You will get different opinions on the above list depending on who you talk to. There are also a billion languages specific to different programs like Matlab, Mathematica, IDL, SAS, Arc Macro Language, etc.

Useful commercial software that I have ignored

These are programs and libraries that can be extremely powerful and useful for certain problems, but that I have not covered. You will need to look elsewhere for information on them. They are included just for completeness. I may have given some aruments above for why not to use commercial software, but I still do use a ton of it. I keep trying to estimate how much has been spent on me personally by my employers. My current estimate is on the order of greater than half a million dollars. It all adds up! There is no particular order to the madness...

Using the bash shell and Essential Unix commands

In this section, we need to cover how to get around in the shell, run some programs, and look at files. You really need to go get and read a basic unix book. FIX: I need to find one that is affordable, short, and "easy". Does anything like that exist? However, to keep this text self contained, we will cover the basics of unix shells. If you remember using PC-DOS or MS-DOS, delete that knowledge from your brain right now. Better to start from scratch!

Dealing with basic files. How to copy, move/rename, and remove files.

There are some dangerous things about naming files. Here should be some guidelines on how not to get in trouble. Stick to [a-zA-Z_-] in your filenames. Do NOT use spaces in filenames. Do not name files the same but with different capitalizations. This works on many systems, but it will kill you on Mac OS X and Windows.

How to view files with less. "Less is more"

Grep grep grep. It would be a scary world without grep.

Dealing with columns and rows of data. tail, head, awk'ing of columns. If your awk is longer than one line, stop now. Put down awk and go use python or perl.

Editing files with emacs

Emacs and vi are both tricky editors when you are just starting out. However, after a few days in emacs, things will get easier and the power of having emacs as you text editor is amazing. I know many people really favor integrated developement environments, but I have been able to use emacs for many languages over the last 14 years, while I have learned lots of GUI's for different developement environments that were only good for certain platforms or languages. Yes, emacs is way more powerful than VI (even vim). In addition to using the drop down menus from the top of the window, you will want to know some basic emacs key commands. When you see a C-, that means hold down the CTRL key and press the key that follows. The M- is the META key. Don't see a meta key on the keyboard? Then you can press and RELEASE the ESC key. Then type the letter that follows the "-". Need to talk about creating a .emacs file.

Revision Control with RCS and CVS

There is a CVS book that is available in print or here.

Here is the CVS manual.

See my document here for now: Revision Control using CVS (Convurrent Version System)

Of course, Aurelio will tell me that I really need to get on the Subversion (svn) band wagon.

Where to get help?

RTFM == Read The F'ing Manual. This is usually what you do not want to hear from someone that you are asking help from. But if you are stuck with figuring it out for yourself, here are some things you can do.

Document what you do and work on

Create a text file in which you log what you do. May an entry for each time or day that you work in order. Watch out for proprietary programs. If you do use a proprietary program, make sure that you can export all your logs to flat ascii. Programs go away and file formats change. Your research can easily still be important 40 years after you did it. Keep a journal/notebook too. I highly recommend the bound art books that you can get from your campus bookstore. What you do is valuable. Treat your work and yourself with respect. This is a place to draw, doodle, write ideas/frustrations/successes.

Basic scripting with bash

  1. csh - Do NOT use csh. You have tcsh. See tcsh why you shouldn't even use tcsh. csh is for masochists.
  2. tcsh - Why would you use tcsh when you have bash? If you are going to learn a shell, learn one that will really work well for you. At first glance, tcsh and bash are the same, but deep down, bash is so much better than tcsh. I switched from tcsh to bash in 2000. You should switch too!
  3. sh - Use sh when you have to write scripts that must be portable no matter what. Do not make it your shell unless you like pain.
  4. korn - (pdkshtoo) Okay, so I haven't used ksh really. Might be okay, but it's not used by many that I've run into.
  5. python/perl - If you're adventurous, these could be pretty productive to use.
Here is the official manual for bash.

So now that we have all agreed to use bash, we can go on with life. How about some simple examples. Here is one that is handy one that I use a lot. I often need to batch convert digital images from one format to another. ImageMagick has a nice program called convert that makes it pretty easy. Now we just need to call that program for each image in a directory. Here is the code:

	for image in *.tif; do
	  echo "Converting $image to ${image%%.tif}.png"
	  convert $image ${image%%.tif}.png
Here is what you might see if you type these commands into the bash shell with 3 tif files in the current directory called 1.tif, 2.tif, and 3.tif:
	Converting 1.tif to 1.png
	Converting 2.tif to 2.png
	Converting 3.tif to 3.png

Makefiles and compiling

Here is the official manual for GNU make.

Write lots about how cool make is.

Documenting your code with doxygen

Here is the official manual for doxygen.

Doxygen makes great documentation of your code much easier. No fortran support yet, but it is great for lots of other languages!

General C++

The text is not going to teach you in depth about programming in C++. It will just get you going with a few hints and ideas of what to look for. A good online resource for C++ programming is C++ Annotations Version 6.1.1b

C++ Standard Template Library

C++ Complex Template

C++ String Template

C++ Vector Template and File Parsing

Linking to Fortran 77 code


Basic HTML

You really should know at least just the basics of HTML enough to make a quick web page. Graphical programs can make pretty pages, but for pure content, consider writing just a little html yourself. First a couple ways to make html without having to know about html tags. The one that most people know is MS Word and Powerpoint.

Command line arguments - Using gengetopt

Here is the official manual for gengetopt.

Seen programs like GNU grep and tar that have --help? Want that for your programs? Then gengetopt is the easiest way to go.

Making manual pages for programs - help2html

Here is the official manual for help2man.

Writing groff based man pages is no fun, so let's use help2man to make life easier! help2man uses the program's --help to get the basic man page and then you can insert small pieces.

autoconf - mastering the configure beast

I found out you link to http://mdcc.cx/autobook from http://schwehr.org/papers/intro-programming.html . That's cool :)

Perhaps you could add links to copies of recent autoconf, automake and libtool info files: http://www.gnu.org/software/autoconf/manual/ http://sources.redhat.com/automake/automake.html and http://www.gnu.org/software/libtool/manual.html . These are more actively maintained. (The Unofficial Autobook text is getting obsolete, just as fast as the official text... :( )

Your document looks very promissing, thanks for your work!

Bye, Joost

The Autobook describes the whole automake/autoconf/libtool process. However this book is getting "old" and has not been updated since 2001. That is a long time for tools like this that have to adapt to every operating system and tool release from every major vendor.

You might prefer to use the Unofficial AutoBook that is being more actively maintained or the autotut (which seems to be a a cranky link so consider the google cache option).

Getting started with autoconf can be overwhelming, but if you are writing a lot of programs, this will help you get them running on lots of platforms with ease. However, the initial setup is a big task.

I have yet to get really good at this, so a section on it may be a long time coming.

GNU Scientific Library (GSL)

Here is the GSL manual.

This should be the staple starting point for data analysis algorithms.

Basic debugging with GDB

FIX: Write up a little sample program that hits and assert and debug the assert.
gdb foo
# you get some assert triggered
up  # do this until you are in a stack frame that has something
    # useful... the bottom couple will be the assert mechanisms 
print myTroubleSomeVariable

Parallel Processing

This is a monster can of worms. A multitude of solutions. I recommend using one pthreads and/or lam-mpi. Try to avoid locking yourself into a solution. Some programs will run faster on a single machine than multiple and you may not be able to have more than one CPU at a particular time. The overhead can really hurt some applications.

Do not let vendors lock you into proprietary APIs.

How to write a bug report

In the process of writing and using your software, you will run into times when there is probably a bug in a library or program by someone else. Here are some guide lines that will help you to resolve the problem faster.
  1. Be polite The problem may turn out to be in your code and other people are giving you their time to look at this bug report. With open source software, they very likely are not paid for their work.
  2. Write clearly Correct English is very important. Make sure to spell check you report. There is nothing like reading through a gramatically incorrect and mispelled email to make you not want to help someone. (Hey, maybe I should spell check this doc???)
  3. Describe the problem up front Describe the problem in a short few scenetences right up front in the report. Don't burry it.
  4. Identify your environment Many bugs are specific to the environment you are running. Make sure to specify the operating system and version, what type of CPU and any major changes or unusual circumstances that there are with the system. How much physical RAM is installed. Note that your hard disk space is NOT RAM!!!

    Common untilities to tell about your system are: top, uname -a, hinv (SGI's only), sysinfo (Solaris sometimes), cat /etc/redhat (redhat, fedora, mandrake).

  5. How is the program linked If you are reporting a problem against a linked library, give them the context that your program is in with all the libraries the program is linked against.
  6. Stack trace If the system is crashing down in the library in question, include a stack trace. Start up the GNU debugger like this: gdb programname. Run the program and when it fails, type: backtrace. This lists all the function/method calls and there arguments down to where the system failed.
  7. Example Case Try to create the smallest test case that causes the system to fail. You want a fellow developer to be able to recreate the same problem on their system if at all possible. You are more likely to get a quick solution then.
  8. There has got to be some more good ideas, right?

That is for now folks. -kurt