Simple and easy visualization of large, complex datasets
It was Feynman who said, “what I cannot create I don’t understand.” I’ve created a visualization tool for data exploration and analysis. It is very heavily inspired by the ggplot2 graphics library for R, which uses an expressive, powerful and flexible interface, the grammar of graphics, for producing plots. It is an underestimate to say that learning ggplot2 has increased my data exploration productivity by 10x.
Inspired by the elegance of the grammar of graphics, I implemented a toy version. For fun, I didn’t use any charting or data viz libraries like d3 or highcharts. I also built it in clojure to satisfy my longstanding curiosity about LISP enlightenment.
I’ve named it xvsy and you can find it on github.
Like ggplot2, plots are declaratively specified by what the graphic should look like, rather than how it should be generated. 1 Grammar of graphics abstractions for querying and visualizing data enable rapid insights to be gleaned from even highly complex and large datasets.
Nothing beats a good demo, so let’s get right to the one I’ve prepared below. The embedded plots are generated by my tool, and they are interactive! Mess around with them. If you get stuck, refresh the page to reset the plots to the original settings.
Demo: Which frequent flyer program to enroll in?
We’ll be using the Bureau of Transport Statistics Airline On-Time Performance Data to answer a simple question: what is the optimal frequent flyer program(s) to enroll in? For the purpose of illustrating the utility of the grammar of graphics, I’ll focus on finding airline carriers that offer direct flights to cities I frequently fly to. Excluded from the analysis are i. differences in perks for each program ii. international flights, and iii. ticket prices.
The dataset I use includes all US flights from January 2012 through March 2015, which is just under 20 million flights. Recorded for each flight was the date, the originating,
Origin, and destination,
Dest, airports; the arrival and deperature local times,
DepTime; how long takeoff was delayed by,
DepDelay, and over 100 other variables. Detailed explanations of each column may be found here.
I’ll be assuming that we are flying out of the San Francisco Bay Area throughout this demo, but you should customize the plots below to suit your needs.
Airline On-Time Performance Data: US Domestic Flights from Jan 1 2012 to March 31 2015
Let’s make a bar chart2
First, let’s examine the direct flights offered by the San Francisco (SFO), Oakland (OAK) and San Jose (SJC) airports.
I’ve set a filter to query only flights where the
Origin airport is
in one of
SFO, OAK, SJC. The x-axis of our plot will correspond to the
Dest airport of these flights, and the y-axis will
count the number of these flights.
Behind the scenes, these parameters are transformed into SQL3.
1. Number of direct flights departing SFBA airports, from Jan 2012 to Mar 2015
Unsurprisingly, the most popular destination for flights leaving the Bay Area is Los Angeles (LAX). Nearby airports in Las Vegas (LAS) and San Diego (SAN) are also frequently flown to.
To iflyoak, or not to iflyoak, that is the question —
Choosing an airport to fly out of could be important. To that end, our bar chart would be more useful if it displayed whether the flights departed from SFO vs OAK vs SJC. This can be accomplished by setting the
fill aesthetic to
Origin, which causes the fill color of the bars to be mapped to the Origin airport.4
2. Number of direct flights departing SFBA airports, by Origin
Almost all direct flights to Chicago O’Hare (ORD) and New York JFK are flown out of SFO. This also holds for Boston (BOS) and Atlanta (ATL). Conversely, closer destinations such as Los Angeles, Seattle (SEA) and Denver (DEN) have flights that are nearly evenly represented across the three SFBA airports.
Plot 2 suggests the following time zone rule of thumb: fly out of SFO if the destination is in the Central or Eastern time zones. For major airports in Pacific and Mountain time zones, OAK and SJC are fine alternatives.
For Plot 3, let’s limit the analysis to flights headed to New York (EWR, JFK, LGA); Los Angeles (LAX, LGB, BUR); Chicago (ORD, MDW); Houston (IAH, HOU); Austin (AUS); Las Vegas (LAS) and Seattle (SEA). These also happen to be all the cities I’ve visited in the past 5 years, so it’s a good proxy for the cities I’m likely to fly to in the future.
Positioning the bars as
dodged-bars also allows the relative flight volumes between the three SFBA airports to be readily compared.
3. Direct flights from SFBA to destinations in NYC, LA, Houston, Austin, Las Vegas and Seattle
Which frequent flyer program? Choose from United, Virgin and Southwest
Now let’s examine the destinations each airline flies to. The
facet_x aesthetic slices the dataset into horizontal panels, which is used in Plot 4 to distinguish between the SFBA airports. The fill color has been changed to represent the airline carrier.
4. Direct flights from SFBA to select destinations, by Origin and UniqueCarrier
United (UA), Virgin America (VX) and Southwest (WN) are the clear winners in Plot 4. For plot 5, I’ll further restrict our analysis to only those airlines. Since I am primarily concerned with whether an airline offers direct flights to the destinations at all (instead of the exact number of flights), I’ll display the data as a 2D histogram,
5. Number of United (UA), Virgin (VX) and Southwest (WN) flights from SFBA to select destinations from 1/1/2012 - 3/31/2015
From plot 5, it is apparent that NYC is the only city Southwest doesn’t provide service to. Although Southwest is missing flights to New York (JFK), Newark Liberty (EWR), Houston Bush (IAH), and Chicago O’Hare (ORD), it does offer flights to Chicago and Houston by way of Midway (MDW) and Hobby (HOU), respectively.
Similarly, Virgin America has flights to all the cities listed except for Houston. United is the only airline that has direct flights to NYC, LA, Chicago, Houston, Seattle, Las Vegas and Austin from SFBA airports.
Ultimately, choosing an airline will involve more intricate trade offs in personal preferences. For example, if OAK is a far more convenient airport than SFO or SJC, Plot 4 indicates that Southwest may be a good choice. You may prefer Virgin over United if you appreciate Virgin’s design sensibilities and don’t visit Houston frequently.
Identifying these tradeoffs is dramatically easier with the structured and interactive approach to data analysis that the grammar of graphics enables.
Rapid data exploration facilitates interactive insights
The data visualization abstractions shown here enable rapid exploration of datasets, even when there are many columns involved. By specifying a plot by its color, position and facetting aesthetics, we can easily and iteratively determine the relationship between 4 or more variables.
In the flights dataset, we looked at the relationship between Origin and Dest airports, number of flights and the UniqueCarrier airline to glean a timezone rule and determine frequent flyer program candidates.
These abstractions are also highly performant. Although the flights dataset is tractable for a modern workstation (20 million rows, 10GB uncompressed), since the underlying queries are expressed in terms of map/aggregate functions, much larger datasets can easily be processed by Big Data frameworks based on map/reduce.
You can continue playing with the flights data on the standalone flights visualizer.
Hard drive failure analysis: the Backblaze dataset
Drop me a line if you find something interesting!5
Updated June 16, 2015 to add github link to xvsy
Similar to how higher order functions like map/filter/reduce abstract away the details of loop iteration.↩
SELECT Dest AS x, COUNT(Dest) AS y FROM flights_dataset WHERE Origin IN ('SFO', 'OAK', 'SJC') GROUP BY x
The SQL generated is now:
SELECT Dest AS x, Origin AS fill, COUNT(Dest) AS y, FROM flights_dataset WHERE Origin IN ('SFO', 'OAK', 'SJC') GROUP BY x, fill
Thanks to Chetan Narain and Mark Gray for reading drafts of this post.↩