Wednesday, October 22, 2014

Blog 3: Fly Happy

Those who travel often have an airport they love to hate. These airports drag out an already long and uncomfortable day. In reviewing the flight data for 2013, the airport with the greatest amount of arrival delay was Chicago O’Hare. When faced with traveling through an airport with lots of delays what can you do to minimize your chances of being delayed? As a team we took a look at the data to find some answers.

To get the data on flights we used two sources. The first was http://www.rita.dot.gov/ and the second was http://openflights.org/ . RITA, Research and Innovative Technology Administration, is a site managed by the United States Government and it provides large data sets pertaining to transportation. RITA provided the core data about flights, delays, airports, and much more that we needed for analysis. Unfortunately, some of the data is kind of cryptic. For example the airports are listed with a three character airport code. This does not help in knowing the full airport name or in knowing the exact location of the airport.

This is where the second data set comes in. Openflights.org provides a dataset with all the location and identification data for all the airports. By combining the two data sets based on the three character airport ID we are able to get a more complete understanding of the data.

Once we had the data we needed a way to make sense of it. First we cleaned the data by removing information we did not need, such as on time flights. Using the power of pivot tables and strong integration with MySQL, Excel has helped us to quickly clean & make meaning out of data. The data is historical and structured simplifying the migration to MySQL for quick query processing. MySQL also integrated well with our visualization tool of choice, tableau. The ease of using measures, dimensions, and coming up with dashboards is what inspired us to use tableau. Also, using Tableau Public, we were able to publish our dashboards online and obtain JavaScript’s to embed.

With the data prepared we were ready to answer some questions. The first visualization provides insights into flight delays for routes originating from Chicago O'Hare International Airport. The darkness of the routes corresponds to the length of the arrival delay. The pie charts illustrate how each delay type contributes to the total arrival delay.
 

First we wanted to know who which airports have the greatest arrival delay from Chicago. The next visualization shows the top 10 destination airports from Chicago with maximum delay, where dark orange represents the most delayed destination – ‘La Guardia’. This could be due to the large number of flights from Chicago for ‘La Guardia’.

As a passenger there are not a lot of options for different layovers when traveling to your destination.  If you must travel through an airport with heavy delays what can you do to mitigate the chances of being delayed? This question drove our next several questions. Starting with when is the best time of day and day of the week to fly to avoid delays? In analyzing the data, we plotted delays based on the time of day and day of the week, creating a map of when to fly. We can see early hours (5a.m. to 8 a.m.) are best to fly while flight departing 3:00 p.m. to 6 p.m. have most arrivals delays. Also, Wednesday, Thursday, and Sunday have highest delays in afternoons. Sunday is day 7.

We performed the same analysis for month and day showing that heavy delays are expected in June (6). This particular chart allowed us to filter based on air carrier so we can tell when each carrier had delay issues.

Code
Carrier
9E
Endeavor Air
AA
American
DL
Delta
EV
ExpressJet
MQ
American Eagle
OO
Skywest
UA
United
US
USAir
VX
Virgin America
YV
Mesa Airlines

Carrier issues was the next point of investigation. Which carrier has the least or most delays? Delta DL has the most cancelations mostly since it runs Endeavor Air as well.



We then did the same type of analysis on canceled flights.
We found that La Guardia has the most cancelations while Los Angeles has the fewest.

We also found that Saturday (6) is the best day to travel to avoid cancelations.
In conclusion we found that it is best to travel early in the morning on a Saturday. June is the worst month to fly, Delta has the most issues, and La Guardia is almost guaranteed to give you issues.



Wednesday, October 1, 2014

Blog 2: NoSQL vs. Relational Databases

In the world of databases, the trend toward NoSQL databases is increasing. There are strong opinions from both sides, with each side claiming that their database solution is superior to the other while citing different reasons. An example of this is the comparison of Datastax to other DBMS solutions as seen in the table below. The question is, who among these different sects of databases is right, or are they wrong altogether?

[3]

The Pros and Cons of NoSQL vs. Relational Databases

A good debate on what database type to use, and when to use it, is illustrated in the following video of Craig Steadmans interview of William McKnight, president of McKnight Consulting Group.


The two main advantages of NoSQL database over relational database:
  1. The NoSQL database scales very well.
With NoSQL, when the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. However, a relational database does not have such good scalability.
“NoSQL databases usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.” [1]
  1. The NoSQL database allows for heterogeneous data.
In a computer hardware store all of the products have a price and a vendor. However, different components of the computer have different properties. For example, CPUs have a clock rate, hard drives and RAM chips have a capacity, monitors have a resolution. In the relational database, there are two ways to deal with this real world problem. The first option is to create a very long productID-property-value table. The second option is to create a very wide and sparse product table with every property, but the problem is that most of the values in the table would be NULL. In a NoSQL database, this problem is more easily avoided because it allows each document in a collection to have a different set of properties.
The two main disadvantages of NoSQL database over relational database:
  1. Denormalization
When your data is very relational and able to be denormalized, the relational database would be your best choice because the NoSQL database doesn't use JOINs.

For denormalizing the data in a relational database, there are several rules on how to normalize the data. But the NoSQL database is rather new technology, it lacks rules about denormalization.
  1. Complex transactions
The NoSQL database does not handle complex transactions as well as the relational database. When the actions affect more than one document, the NoSQL database cannot guarantee consistency between the tables.

Its about your data
Jnan Dash, a long time database professional, points out several things to consider about the data before deciding what choice to make in a database solution.
Tabular vs Complex
If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling data, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. [2]
Historical vs Dynamic
“What is the volatility of the data model?" Is the data model likely to change and evolve or is it most likely going to stay the same? Generally speaking, all the facts about the data model are not known at design time, so some flexibility is needed. This presents many issues to the relational database management system (RDBMS) users of the world. [2]
Conclusion:
Each type of database serves a different purpose. NoSQL is good for complex unstructured data that is constantly changing and gaining new attributes. RDBMS is good for structured data where the data attributes are well defined and understood. RDBMS has a proven record spanning decades and is trusted, while NoSQL is a relatively new configuration lacking the same trust. So the best database is the one that best fits your data and goals. But, if you still can't make up your mind don’t worry, some database vendors are developing solutions that allow for the coexistence of NoSQL and RDBMS.[2]

References APA Format:
  1. NoSQL Databases Explained. (2014, October 1). Retrieved October 1, 2014, from http://www.mongodb.com/nosql-explained

  1. Dash, J. (2013, September 18). RDBMS vs. NoSQL: How do you pick? | ZDNet. Retrieved October 1, 2014, from http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/

  1. Relational Database to NoSQL. (2014, October 1). Retrieved October 1, 2014, from http://www.datastax.com/relational-database-to-nosql

  1. Preslar, E. (2013, September 16). McKnight: Relational vs. NoSQL databases not a winner-take-all game. Retrieved October 1, 2014, from http://searchdatamanagement.techtarget.com/video/McKnight-Relational-vs-NoSQL-databases-not-a-winner-take-all-game