top of page

What is the maths you really need in Data Science?

Highly debatable subject, especially because the job title "Data Scientist" is still fluid and it means different things to different people. Now, assume the term 'Data' fits everybody, the addition of 'Scientist' leads to a number of ideas about what a scientist is/should be and does/should do like here or here or ... I just stop here.


If we stick to the notion that a scientist investigates the unknown and contributes original and novel insights, then she needs a few more skills, maybe (just maybe) in addition to the job specifications of the Data Analyst. So, yes, there's a risk is to be seen as the 'jack of all trades'.


Let's be non parametric and skip questions like "does a Data Scientist need a PhD./MSc.?" and let the data speak (my data. Yours may differ and that's perfectly fine).

The list may grow as you touch upon more exotic subjects (Random Walks, Markov Chains, Boltzmann Machines...) but this is what I think I am using in my work, beyond the obvious grasp of statistics (in contrast to other opinions on the matter):

  1. Computational cost

  2. System Theory

  3. Signal analysis

  4. Design of Experiments

Let's look at them. Note, for each, I provide (i) one link to the basics (Wikipedia or other introductory material) and (ii) another, more advanced taster (because it's impossible to be exhaustive).


1. Computational Cost. Or perhaps 'just' evaluation of the algorithms in terms of performance against resources, typically memory, CPU time etc. To do this you need to know basic algebra and specifically elementary combinatorics (sometimes under the umbrella of discrete mathematics) Basically you need to be able to count: go through the loops if there are any, and estimate the number of lines of code which will be actually executed. This might be quite difficult and you rarely will know exactly, thus you need to estimate an 'order of magnitude', the famous "of the order of", which in maths is expressed as "O(some function)" or "Big O". So quite literally, you need to know your limits (yes, that's calculus).


2. System theory. Well, this is perhaps a far shot. What I mean here is the ability to look at an algorithm as a 'system' you know nothing about. Why? because it will soon become complex and /or deal with a large amount of data and/or with a number of variables (100 - 1 000 - 1M features?). So you will need to relate input parameters to outputs and do so in a systematic way, formalizing what your black box does, even if you do not know how. This is the situation not only in software engineering (ex. in testing) but also in some ML algorithms like Random Forest (RF) and also in Neural Network applications (which do not even need to be 'deep'). People just do not know why these marvels work (e.g. an open question is: "can we predict which nodes will be firing during a training?").


3. Signal analysis (or is it just about "functions"? Thus, as your 'code' is a system, then you may want to see what it does (response) when you feed in some input (data and parameters). This is the typical situation of a regression, linear or not, but also when you deal with Neural Networks or Deep learning. People have been switching from the sigmoid (early days) to more modern and efficient activation functions, like the Rectified Linear Unit (ReLU). So you need to know your basics on functions and more generally what a graphed function is telling you (that falls again under "Calculus", but it should be the skill of the analyst as well).


4. Design of Experiments. To do the above, you need Design of Experiments (DOE) to design a suitable set of "experiments" in order to test your hypothesis (because you have one or more, haven't you?). You may ask questions like "how many experiments do I need to run with parameter x1 fixed and parameters x2,...xk varying?" Or: "Shall I change one element at a time, keeping the all rest constant?" (incidentally, you may find this is not the ideal thing to do). Or else, to find variables ("factors") that may impact your target or even new factors you didn't know about. This is a statistically sound way of optimizing your tests, based on the fact that you have finite time and finite resources.


Is that all? I am not sure. And besides, these elements may not be used at once in all of your work.

Finally, an open ended question for you: do you have examples of other maths elements deployed in your analysis?




Who am I?
Carlo Fanara

Following a career in IT, I spent 20+ years in physics (MSc in Nuclear Physics, Turin, Italy, and a PhD in plasma physics, Cranfield, UK).

After several years in academia, in 2008 I moved to private companies working in R&D and, since 2014 in Data Science,

My passions include Science and Languages.

I teach programming and Data Science and occasionally blog on these matters..

Current interest: Deep Learning, IoT, Time Series, and Rare Events.

Teaching
Mind your business!
Places I visit
Follow Me
  • LinkedIn Social Icon
Search By Tags
bottom of page