Simple and easy visualization of large, complex datasets

Posted on June 1, 2015

It was Feynman who said, “what I cannot create I don’t understand.” I’ve created a visualization tool for data exploration and analysis. It is very heavily inspired by the ggplot2 graphics library for R, which uses an expressive, powerful and flexible interface, the grammar of graphics, for producing plots. It is an underestimate to say that learning ggplot2 has increased my data exploration productivity by 10x.

Inspired by the elegance of the grammar of graphics, I implemented a toy version. For fun, I didn’t use any charting or data viz libraries like d3 or highcharts. I also built it in clojure to satisfy my longstanding curiosity about LISP enlightenment.

I’ve named it xvsy and you can find it on github.

Like ggplot2, plots are declaratively specified by what the graphic should look like, rather than how it should be generated. 1 Grammar of graphics abstractions for querying and visualizing data enable rapid insights to be gleaned from even highly complex and large datasets.

Nothing beats a good demo, so let’s get right to the one I’ve prepared below. The embedded plots are generated by my tool, and they are interactive! Mess around with them. If you get stuck, refresh the page to reset the plots to the original settings.

Demo: Which frequent flyer program to enroll in?

We’ll be using the Bureau of Transport Statistics Airline On-Time Performance Data to answer a simple question: what is the optimal frequent flyer program(s) to enroll in? For the purpose of illustrating the utility of the grammar of graphics, I’ll focus on finding airline carriers that offer direct flights to cities I frequently fly to. Excluded from the analysis are i. differences in perks for each program ii. international flights, and iii. ticket prices.

The dataset I use includes all US flights from January 2012 through March 2015, which is just under 20 million flights. Recorded for each flight was the date, the originating, Origin, and destination, Dest, airports; the arrival and deperature local times, ArrTime and DepTime; how long takeoff was delayed by, DepDelay, and over 100 other variables. Detailed explanations of each column may be found here.

I’ll be assuming that we are flying out of the San Francisco Bay Area throughout this demo, but you should customize the plots below to suit your needs.

Airline On-Time Performance Data: US Domestic Flights from Jan 1 2012 to March 31 2015

# Year Month DayofMonth Origin Dest UniqueCarrier DepTime ArrTime TailNum FlightNum
0 2012 1 1 OGG DFW AA 1801 512 N392AA 6
1 2012 1 1 DFW OGG AA 1210 1615 N392AA 7
2 2012 1 1 DFW HNL AA 1407 1808 N359AA 5
3 2012 1 1 OGG LAX AA 2201 508 N5DHAA 14
4 2012 1 1 JFK LAX AA 855 1142 N325AA 1
19689522 2015 3 31 DFW GSP MQ 2010 2337 N624MQ 3343
19689523 2015 3 31 DFW AEX MQ 1705 1821 N644MQ 3344
19689524 2015 3 31 ORD DAY MQ 1208 1418 N690MQ 3342
19689525 2015 3 31 STL LGA MQ 806 1117 N1EAMQ 3634
19689526 2015 3 31 GSP DFW MQ 559 800 N601MQ 3341

Let’s make a bar chart2

First, let’s examine the direct flights offered by the San Francisco (SFO), Oakland (OAK) and San Jose (SJC) airports.

I’ve set a filter to query only flights where the Origin airport is in one of SFO, OAK, SJC. The x-axis of our plot will correspond to the Dest airport of these flights, and the y-axis will count the number of these flights.

Behind the scenes, these parameters are transformed into SQL3.

1. Number of direct flights departing SFBA airports, from Jan 2012 to Mar 2015

Unsurprisingly, the most popular destination for flights leaving the Bay Area is Los Angeles (LAX). Nearby airports in Las Vegas (LAS) and San Diego (SAN) are also frequently flown to.

To iflyoak, or not to iflyoak, that is the question —

Choosing an airport to fly out of could be important. To that end, our bar chart would be more useful if it displayed whether the flights departed from SFO vs OAK vs SJC. This can be accomplished by setting the fill aesthetic to Origin, which causes the fill color of the bars to be mapped to the Origin airport.4

2. Number of direct flights departing SFBA airports, by Origin

Almost all direct flights to Chicago O’Hare (ORD) and New York JFK are flown out of SFO. This also holds for Boston (BOS) and Atlanta (ATL). Conversely, closer destinations such as Los Angeles, Seattle (SEA) and Denver (DEN) have flights that are nearly evenly represented across the three SFBA airports.

Plot 2 suggests the following time zone rule of thumb: fly out of SFO if the destination is in the Central or Eastern time zones. For major airports in Pacific and Mountain time zones, OAK and SJC are fine alternatives.

For Plot 3, let’s limit the analysis to flights headed to New York (EWR, JFK, LGA); Los Angeles (LAX, LGB, BUR); Chicago (ORD, MDW); Houston (IAH, HOU); Austin (AUS); Las Vegas (LAS) and Seattle (SEA). These also happen to be all the cities I’ve visited in the past 5 years, so it’s a good proxy for the cities I’m likely to fly to in the future.

Positioning the bars as dodged-bars also allows the relative flight volumes between the three SFBA airports to be readily compared.

3. Direct flights from SFBA to destinations in NYC, LA, Houston, Austin, Las Vegas and Seattle

Which frequent flyer program? Choose from United, Virgin and Southwest

Now let’s examine the destinations each airline flies to. The facet_x aesthetic slices the dataset into horizontal panels, which is used in Plot 4 to distinguish between the SFBA airports. The fill color has been changed to represent the airline carrier.

4. Direct flights from SFBA to select destinations, by Origin and UniqueCarrier

United (UA), Virgin America (VX) and Southwest (WN) are the clear winners in Plot 4. For plot 5, I’ll further restrict our analysis to only those airlines. Since I am primarily concerned with whether an airline offers direct flights to the destinations at all (instead of the exact number of flights), I’ll display the data as a 2D histogram, bin2d, plot.

5. Number of United (UA), Virgin (VX) and Southwest (WN) flights from SFBA to select destinations from 1/1/2012 - 3/31/2015

From plot 5, it is apparent that NYC is the only city Southwest doesn’t provide service to. Although Southwest is missing flights to New York (JFK), Newark Liberty (EWR), Houston Bush (IAH), and Chicago O’Hare (ORD), it does offer flights to Chicago and Houston by way of Midway (MDW) and Hobby (HOU), respectively.

Similarly, Virgin America has flights to all the cities listed except for Houston. United is the only airline that has direct flights to NYC, LA, Chicago, Houston, Seattle, Las Vegas and Austin from SFBA airports.

Ultimately, choosing an airline will involve more intricate trade offs in personal preferences. For example, if OAK is a far more convenient airport than SFO or SJC, Plot 4 indicates that Southwest may be a good choice. You may prefer Virgin over United if you appreciate Virgin’s design sensibilities and don’t visit Houston frequently.

Identifying these tradeoffs is dramatically easier with the structured and interactive approach to data analysis that the grammar of graphics enables.

Rapid data exploration facilitates interactive insights

The data visualization abstractions shown here enable rapid exploration of datasets, even when there are many columns involved. By specifying a plot by its color, position and facetting aesthetics, we can easily and iteratively determine the relationship between 4 or more variables.

In the flights dataset, we looked at the relationship between Origin and Dest airports, number of flights and the UniqueCarrier airline to glean a timezone rule and determine frequent flyer program candidates.

These abstractions are also highly performant. Although the flights dataset is tractable for a modern workstation (20 million rows, 10GB uncompressed), since the underlying queries are expressed in terms of map/aggregate functions, much larger datasets can easily be processed by Big Data frameworks based on map/reduce.

You can continue playing with the flights data on the standalone flights visualizer.

Hard drive failure analysis: the Backblaze dataset

The backup provider Backblaze has released a dataset on the failure rates of > 40,000 hard drives, which I’ve used to make a hard drive failure visualizer.

Drop me a line if you find something interesting!5

Updated June 16, 2015 to add github link to xvsy

  1. Similar to how higher order functions like map/filter/reduce abstract away the details of loop iteration.

  2. Because making a bar chart is the “hello, world!” of data viz.

      Dest AS x,
      COUNT(Dest) AS y
    FROM flights_dataset
    WHERE Origin IN ('SFO', 'OAK', 'SJC')
    GROUP BY x
  4. The SQL generated is now:

      Dest AS x,
      Origin AS fill,
      COUNT(Dest) AS y,
    FROM flights_dataset
    WHERE Origin IN ('SFO', 'OAK', 'SJC')
    GROUP BY x, fill
  5. Thanks to Chetan Narain and Mark Gray for reading drafts of this post.