Repeatable and transparent data analysis: making the leap from Excel to Python (with tutorial)
By John Field
For most of my time in graduate school, I avoided learning a programming language like the plague. I did all my analyses in spreadsheets and sometimes Access, and my figures in Excel. Finally things got to a point where the scale of the analysis I wanted to do got so big that there wasn’t any alternative to writing a program to automate it. After a lot of research and consultations I decided to teach myself Python via the Software Carpentry series of online tutorials, as recommended to me by Andrew Tredennick. And boy, I wish I had done it sooner- getting into basic programming for data analysis was much easier that I had feared, and has really opened up a new world for me.
“…Excel…is still notorious for errors relating to copying & pasting data or specifying cell ranges…In contrast, careful code-based analysis is highly consistent with the scientific method”
But to back up a bit, why might you want to start coding your data analyses in Python, or any other language, in the first place? For analyses that you repeat often or that you eventually intend to publish, there are several reasons to move beyond point-and-click software like Excel. In many cases the data might be too big to treat any other way. Recently several high-profile journals have published articles (for example, here and here) about how computing skills can be essential to tackle many modern scientific problems and the ‘big data’ that often underlie them, but graduate training and peer review processes have been slow to adapt to this new world. Even if you are at a scale to sneak by with Excel, it is still notorious for errors relating to copying & pasting data or specifying cell ranges, sometimes with big consequences.
In contrast, careful code-based analysis is highly consistent with the scientific method:
- repeatability– coding your analysis doesn’t mean that you won’t make mistakes (my first drafts are always full of errors!), but it means that any mistakes will happen consistently every time you run an analysis, and thus can be traced back and corrected.
- hypotheses testing– it might take longer to write the code initially than to do a point-and-click analysis, but the amount of time to run the second or third or forth analysis drops to zero (hopefully). This makes it that much more likely that you’ll repeat the analysis for different datasets in the future, or modify your script slightly to look for different effects in the existing dataset (though be careful with that!).
- transparency and sharing– it’s much easier to hand off a piece of code to a collaborator or reviewer, rather than a long set of instructions describing a series of copy & paste operations in Excel. This improves your ability to collaborate and have others help quality control your work.
Why choose Python over other languages like PERL, R, or MATLAB? For one, Python is open source and free (unlike MATLAB), and gaining popularity (unlike PERL). If you have a Mac running OSX Tiger or more recent, you already have a copy installed, and it’s on many networked Unix/Linux machines out there. Like R, it’s a high-level language that’s interpreted, so you don’t have to mess around with compiling your code every time you want to test it. I’ve heard that it’s better than R at navigating through networks, manipulating files, and talking to the computer’s operating system so you can automate running other programs from within your script. Python uses the indentation of each line of code to define where loops and functions start and end, so it’s straightforward to read and error-check- I’ve never had to spend much time chasing down hard-to-find stray brackets that cause stuff to break! And while I can’t personally vouch for Python’s ability to do fancy statistical analyses like R does, check out this blog if that’s of interest to you.
So how do you proceed?
- Download the program. Even if you do already have a copy installed, it’s a good idea to download a separate Python distribution where someone has packaged in a bunch of useful modules already- the Canopy Enthought Python Distribution (EPD) is a good one to start with.
- Start going through the Software Carpentry Python tutorial, and the Shell tutorial as well if you work on a Mac or one of the NREL network machines.
- As questions come up, try searching or posting on Stack Overflow– chances are good someone has had the same question in the past.
- If you find yourself getting into more complex programs, do yourself a favor and download an Integrated Development Environment (IDE) like PyScripter or PyCharm (that’s what I use). These are programs to help you keep track of variables and data across multiple files, and help you catch bugs in real time as you are coding.
- Learn how to do version control in GitHub or some version of SVN so you can keep tabs on your code as it evolves over time (a lot of IDEs will have this built in).
For those interested in learning more, below is a tutorial illustrating a simple script that covers many of the potential benefits of code-based data analysis mentioned above. This script reads in a directory full of DayCent model output .lis files (DayCent is a biogeochemical model that runs on a daily time step; .lis output files are generated in the form of space delimited data with variable outputs in columns, timesteps in rows), plots a subset of the data, and saves a summary of it to file. It then archives all output to a separate time-stamped directory along with a copy of itself, to help you remember what you did and when you did it, and hopefully be able to re-run it later if need be.
[sourcecode language=”python” wraplines=”false” collapse=”false”]
"""This is your script’s ‘docstring’, a place to write a general
description of what your code does, for example:
This script analyzes a set of DayCent .lis file output, computes
average rates of change in soil carbon, plots all change in soil
carbon vs. time, and archives all results.
# import statements allow you to access useful functions defined
# in other modules outside this one
import matplotlib.pyplot as plt
# You should define functions for operations that will be repeated
# multiple times or re-used in different modules. These functions
# won’t be run until they are called in the main body of the code
# below. Each function can have its own ‘docstring’ description.
def lis_reader(lis_fpath, time_column, soc_column, head_skip,
"""This function reads a DayCent .lis output file and returns
lists of time elapsed and change in soil organic carbon, as
well as the average rate of change. It’s good practice to
describe all arguments (inputs) to your function, including
the data-type (string, integer, etc.) like this:
lis_fpath (str): file/path to .lis file to be read
time_column (int): ‘time’ column in the .lis file.
Note that python starts counting from 0, not 1!
soc_column (int): ‘soc’ column in the .lis file.
head_skip (int): number of header lines to skip over
before starting to read data
tail_skip (int): number of lines to skip over at end
of file (DayCent .lis files always include an extra
# This opens the target .lis file as an object called
# ‘lis_file’ and reads its data line-by-line, selectively
# saving time data and soc data into lists of floating-point
# (i.e. decimal) values, skipping header rows as needed and
# trimming redundant data at the end.
lis_file = open(lis_fpath, ‘r’)
time_values = 
soc_values = 
for i in range(head_skip):
for line in lis_file:
for j in range(tail_skip):
# Next, we note the initial soil organic level and simulation
# starting time and re-compute our ‘soc_values’ and
# ‘time_values’ lists as relative changes.
initial_time = time_values
initial_soc = soc_values
for k in range(len(time_values)):
time_values[k] = time_values[k]-initial_time
soc_values[k] = soc_values[k]-initial_soc
# Now we compute the average rate of change in soil organic
# carbon by subtracting the first soc value from the last
# soc value, and dividing by the time elapsed
soc_rate = (soc_values[-1] – soc_values) / time_values[-1]
# Finally, we send our results back to the main program from
# which this function was called
return time_values, soc_values, soc_rate
# This is where the main body of the code starts- when you run the
# script it starts executing line-by-line from this point. This
# first statement prints out the contents of the docstring for the
# user’s reference.
# We’ll want to save all of our average soc change rate results
# in a comma-separated-value file format. We’ll start by defining
# a text string we’ll use to hold that data, beginning with a
# header of column labels separated by commas and the new-line
# character (n) at the end.
results_string = "file, soc change rate (gC/y) n"
# Next, we prompt the user for the location of the DayCent output
# data they wish to analyze. For each .lis file found there we
# run our lis_reader() function, add the results to a plot using
# the .lis file name as a label, and write the results at the end
# of our results text string.
data_path = raw_input("Please enter the file path to your data: ")
for f in glob.glob(os.path.join(data_path, ‘*.lis’)):
# First, note the name of the .lis file being read
file_name = f.split("/")[-1]
# In my .lis files, time is reported in the 1st column and soc in
# the 10th column, and I have 3 header lines and 1 redundant
# trailing line I need to skip:
time, soc, rate = lis_reader(f, 0, 9, 3, 1)
# This adds my results as an x-y series to a plot
plt.plot(time, soc, label=file_name)
# Display the results to the screen, and add the data to the
# results string
print file_name, "average rate of SOC change: ", rate, "gC/y"
results_string += "%s, %f n" % (file_name, rate)
# After having lopped through and analyzed all the data, we create
# a time-stamped results directory within our data directory….
time_stamp = datetime.datetime.now().strftime("%Y-%m-%d_%H.%M")
results_path = data_path+"/"+time_stamp
# … add a title, legend, and axis labels to our plot and save…
plt.title("DayCent-simulated change in soil organic carbon,n
conversion of Louisiana cropland/rangeland to switchgrass")
plt.xlabel("Elapsed time (y)")
plt.ylabel("Change in SOC (gC)")
# … save our average soc rate change data in .csv format…
results_file = results_path+"/results.csv"
output = open(results_file, "w")
# … and archive a copy of the analysis script itself for future
# THE END!
Here’s what the output looks like on the command line:
nrel-105-184:blog_python_code johnfield$ python daycent_analysis.py
This is your script’s ‘docstring’, a place to write a general
description of what your code does, for example:
This script analyzes a set of DayCent .lis file output, computes
average rates of change in soils carbon, plots all results, and
archives all results and associated meta-data.
Please enter the file path to your data: ./DayCent_output
graze_C_bioenergy.lis average rate of SOC change: -27.5594 gC/y
graze_SiL_bioenergy.lis average rate of SOC change: -44.5013 gC/y
graze_SL_bioenergy.lis average rate of SOC change: -25.9306 gC/y
till_C_bioenergy.lis average rate of SOC change: 19.5517 gC/y
till_SiL_bioenergy.lis average rate of SOC change: 15.1144 gC/y
till_SL_bioenergy.lis average rate of SOC change: 8.23603 gC/y
And here’s the resulting figure:
So there you have it: easy, repeatable, self-documenting data analysis including a figure in less than 130 lines of code (and most of those are comments)!