Learning PAndas with U.S. Election Data

tl;dr look at this colab notebook I made.

As our environments become suffused with data, Working that data effectively adds power to more and more of what we do.

The old days

When I was working on SIMD intrinsics and other optimizations for the libgav1 project, I ran a lot of microbenchmarks as part of the testing routine, and the benchmark results printed as test logs. To transform those results into columnar data, I used an old-fashioned pipeline: grep -o the logs to filter lines and remove unwanted text, sed the result into refined tokens, and awk those into aggregate values (min, max, and mean) grouped by labels (block sizes, in this case), and store the output in a tsv file. I even kept a makefile that would build, test, and use different regular expressions in my commands for each build configuration. Then I would import the tsv files into a spreadsheet to track the differences between changes. Sometimes a speedup on larger blocks would come at the cost of regressions in smaller blocks, for example, and this was a simple way to point that out.

I knew that my method was inflexible, but it was stable so I never looked very deeply into more powerful tabulation tools. I glanced at the pandas documentation but thought it would over-complicate things and force simple operations to resemble enterprise fizzbuzz. In fact, the library’s classes maintain list-like and dict-like interfaces wherever possible, and it does a good job of making code more concise and expressive.

The Learning project

I don’t have any benchmarking data right now, so I decided to use data that should be publicly accessible and easy to understand, U.s. presidential election vote totals. I wanted to see if a candidate’s performance in the primary election for each state, could indicate their performance in the state’s general election. This breaks down into a good breadth of data tasks to touch on most of the library’s key functionality:

reading from spreadsheet files
Removing decorative rows
Joining datasets by state and candidate name
deriving a new column by operating on two other columns
- or from a different column based on the value of another
filtering rows by various conditions
turning column names (state names) into an index
concatenating datasets vertically (for each election year)
grouping and aggregating

the data and full analysis are available in this colab notebook. It’s largely a stream of consciousness, solving problems as I encounter them. I like that format because it ensures every construct is preceded by some real-world motivation.

A general Tour

While The notebook is the meat of this exercise, I will close by distilling what I learned here. Pandas centers on two classes, Series and DataFrame. Both have an index (like a database table’s primary key), which is integer by default, but could be anything. I think of a DataFrame as a dict of Series (columns) that are all aligned by a shared index. You can set any column as a DataFrame‘s new index by calling set_index("colname").

A trippy thing about Dataframes is that the [] subscript operator can work to select one or more columns, or it can select one or more rows by index, and we’ll see why that works below.

Predicate Selection

an expression like my_dataframe["STATE"] == "California" will return a Series of boolean values. passing that expression into the [] subscript operator for a Series or Dataframe will return only the elements/rows with the indices where the boolean is true. Hence, to get only the rows where the “StatE” column is “California,” we can invoke my_dataframe[my_dataframe["STATE"] == "california"]. That can get hard to read very quickly, so I think it’s good practice to assign the inner expression to a variable first.

Because the argument to [] is a Series, it is easily differentiated from a column key, but it still feels a bit like magic.

What if I wanted rows involving California, Wyoming, and Georgia? basic python operators like in and or don’t work here, so we have to use a special function that resembles in, by saying my_dataframe["STATE"].isin(["California", "Wyoming", "Georgia"]).

Switching columns and index values

In my election data, the state name was just another column value, but I had another dataset for state populations, where each state had its own column. To be able to join it with the election data, I needed to transpose the population data so that the state names comprised the index.

Believe it or not, DataFrame.transpose() does exactly that, with no fuss. You just have to make sure your starting index will be reasonable to use as column keys.

Join operations

join() and merge() do roughly the same thing, and it appears that join() is just a more opinionated entry point that calls merge() internally. The on parameter takes a list of columns to join on, and it generally fits the model of a SQL-style join. The notebook has a decent example of usage and pitfalls.

Grouping and Aggregation

If I wanted to see all of a candidate’s general election votes across all states and election years, I could use this: all_data.groupby(["LAST NAME"], as_index=False).agg({"GENERAL RESULTS": "sum"}). This would give me a DataFrame with just “LAST NAME” and “GENERAL RESULTS” as columns.

Deriving new columns

If I want to construct the elements of a new column by looking at the existing columns, I can create a function that accepts a row object which can be subscripted with my dataframe‘s column names, plus any additional arguments I want, and returns a single value. Then I can pass that function to DataFrame.apply(), as in my_data.apply(func_name, axis='columns', other_arg=other_val...). Beautifully, any extra keyword args such as other_arg will be passed directly through to func_name. The call to apply() will return a Series which I can then assign to my_data["new_col_name"].

That’s all for now.

cover art: Giant Panda Walking in the Snow, iPanda via Google Arts & Culture

Next time with pandas: I might recreate one of Matt Bruenig’s data tutorials using pandas, to show how it can simplify his work.