top of page

The Prediction of Flight Delays within the Data Science pipeline: introduction

Calm down: I am not saying I'll predict the delay of the plane you just boarded (although someone might do...). I am just showing an example of a prediction problem which just happens to deal with plane delays (and past ones...).


The aim is to present implementation examples of the Data Science work-flow.

The web is full of examples of machine learning algorithms and code chunks on various aspects of exploration and predictions. However, there are not that many examples following the whole pipeline, from inception down to the publication into a notebook (well, the valuable Titanic aside).

Assume that a problem has been posed, call this "step 0" if you want...

Long and difficult task, this involves a clear-cut definition, which comes from the elicitation phase, aka the interaction with the customers/stakeholders...who usually do not know many (or any) of the technical aspects (and that's OK, because it is not their job). We get back 'home' after a few rounds with them (or our boss does) and are faced with "The question/s". Now what?

I split the 'todo's", our work flow - in several steps. We may further group these in phases (more on this below) and I will provide code chunks dealing with the entire pipeline:

  1. Data gathering

  2. Data Exploration (get acquainted)

  3. Tidy up

  4. Questions you may want to ask (and hopefully answer)

  5. A bit of engineering

  6. Model the data (train)

  7. Evaluate your prediction

  8. The real prediction

  9. Tell somebody what you did (and how)

  10. Clean it up!

Steps 1 to 3 are (at least partially) absent from competitions like Kaggle's and often involve an iterative procedure, especially for point 2 as you may discover new aspects, perhaps even later down this list.


Step 4 is what your boss is asking you or what you think it should be asked, and is relevant to the business based on the knowledge you acquired. This should be checked carefully against Step 0, ("what is that the customer actually wants?") mapping our questions to the customer problem, the 'Question'. Any mismatch should refrain us to proceed any further. In this case, we are likely to loop back to point 1 (or even 0).

Step 5 and 6 might be swapped as the engineering will depend on the chosen model. In step 6 you may find that some features need to be changed in format (class), dropped or split into others, or their levels should be changes when factors).


Steps 7 and 8 are seldom available in competitions, as some external algorithm will establish your accuracy and place you in the leader board (out of your control). In real cases you may iterate 6 and 7 before getting to 8, whose performance is probably assessed by external means (by domain knowledge).

Step 9 is often overlooked because you are so deeply into the technical aspects that you forget that another human being needs to understand, evaluate (and may be sell) you outcomes. It needs to be matched to both the 'Question' from phase 0 and with step 4. Actually, this should be spit into two parts, one for the stakeholder (the 'what') and one for yourself (what+how) and as internal reference, see also next step.

Step 10...So you thought you were finished? Well, in a fast paced environment you might, and unfortunately, there may be little time to polish things to the point where you feel satisfied (I never am...). But consider this: if you are able to polish by generalizing, standardizing (for example transforming some code chunks into functions), and above all document what you have done (!) you'll be ready for re-use next time.

And this is also a good occasion to measure the time spent in each activity. "What?! Time Sheets for such an arty-crafty-highly-intellectual endeavour? Yes, Do it. If not for the single steps (perhaps difficult to discriminate) at least for the major phases.

It will help, NOT to measure your performance, but to improve your evaluations when faced with a similar problem ("it will be ready in one week/month/year etc).

I am sure that each of these parts can be done better, more efficiently and presented in a nice manner. But they represent the effort of standardize your work flow and that's why I propose step 10. Time permitting, I will transform the whole lot into a package.


So then, the first will be Air Flights 2007 and 2008 - Part I, which contains all the mentioned steps. However, it is a working document to from where we start our analysis of the individual steps mentioned above in the posts that follow.

 

Who am I?
Carlo Fanara

Following a career in IT, I spent 20+ years in physics (MSc in Nuclear Physics, Turin, Italy, and a PhD in plasma physics, Cranfield, UK).

After several years in academia, in 2008 I moved to private companies working in R&D and, since 2014 in Data Science,

My passions include Science and Languages.

I teach programming and Data Science and occasionally blog on these matters..

Current interest: Deep Learning, IoT, Time Series, and Rare Events.

Teaching
Mind your business!
Places I visit
Follow Me
  • LinkedIn Social Icon
Search By Tags
bottom of page