top of page

The Prediction of Flight Delays within the Data Science pipeline: Part I

[first published on RPubs.com on 09 June 2016]

Introduction


This is the first of a series of posts dealing with a complete example of data analysis, starting from the data gathering and finishing with the publication or dissemination of the results. It is the first post of the planned series dealing with the entire Data Science work flow, see The Prediction of Flight Delays within the Data Science pipeline and is seen as the first version of steps 1 to 10 in that list.

We use a few datasets taken from the Airline on-time performance contest http://stat-computing.org/dataexpo/2009/. Note that results are present on the website, but as far as I know, none have been posted which use R, so this is the reason for using it here. In passing we might touch on specific libraries and also on different ways of doing things. In fact we start with a few command line instructions under Linux and may use different tools to show some graphical results.

Because: “the data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed”, I arbitrarily selected two datasets, the 2007 and the 2008 to perform similar analysis on these. You may ask why not just one. I just to verify that the outcomes of the to are more or less the same. The 2007, downloaded first, is made of over 7 million observations and 29 variables. 2008 as we shall see, is bigger. To spot differences about the recorded data in the course of the years, I will have a look at older data, the oldest in fact, from 1987. We note that this dataset is smaller so (as you may guess) data increased in time. We will try to see whether this is just due to increased traffic or whether more information (features) was recorded, or perhaps both. Continue here

Who am I?
Carlo Fanara

Following a career in IT, I spent 20+ years in physics (MSc in Nuclear Physics, Turin, Italy, and a PhD in plasma physics, Cranfield, UK).

After several years in academia, in 2008 I moved to private companies working in R&D and, since 2014 in Data Science,

My passions include Science and Languages.

I teach programming and Data Science and occasionally blog on these matters..

Current interest: Deep Learning, IoT, Time Series, and Rare Events.

Teaching
Mind your business!
Places I visit
Follow Me
  • LinkedIn Social Icon
Search By Tags
bottom of page