What is Advanced Analytics, Data Science, Machine Learning, and why do they matter?

Rafal Lukawiecki asks “What is the value of machine learning to business?”

I have been working in data science for more than a decade, using machine learning algorithms long before their recent popularity. I have noticed that business people frequently ask me to explain the value of machine learning. They would like to find out if there is a need for them to get involved with it. If you have (been) asked about it, perhaps I can help!

First let’s introduce some terminology.

Machine learning and data mining

Data mining, an older term dating from mid ‘60s, for most business users, is the same thing as machine learning. But there is an important difference. Data mining is the application of machine learning for the detection of patterns hiding in some data sets. Machine learning, on the other hand, simply deals with the algorithms that can find those patterns.

There are hundreds of such algorithms, eg. boosted decision trees, expectation maximisation clustering, or deep neural networks, and while understanding them is of great academic and commercial interest, a practical user suffices to know that they all belong to just a few common groups. The most notable classes of those algorithms are: classifiers, regressions, clustering, and those that discover similarities or links between cases or events. Classifiers and clustering are used for many purposes, such as: predicting if something is broken, if a customer is happy, is fraud about to occur, to understand what behaviour is unusual and should be analysed further, or to make shopping recommendations.

Advanced analytics

When I talk about advanced analytics, I usually think of a mix of data science (mainly descriptive statistics) supplemented with modern, exploratory data visualisation (think Power BI, Tableau, QlikView, or ggplot2 in R), perhaps some traditional BI (think cubes or tabular models), and a good dose of data acquisition and cleaning. Data preparation may include traditional ETL (Extract-Transform-Load) but usually means hunting for the data you need—like programmatic trawling through transaction or event logs, or even generating new data through experimentation.

Today, advanced analytics is a human-intensive activity. Most of it focuses on data exploration and as a result it offers the business a new understanding of what has been happening that they were not aware of, eg. why are they losing some good clients, what is being missed out, or even why something happens in certain situations. For instance, it can help you diagnose why a system fails when you least expected that to happen, what are the outliers that triggered a false positive security denial, or even what it is that your customers want to do next that you are not delivering to them.

Even performing market basket analysis (MBA) which is definitely a form of advanced analytics, is possible thanks to machine learning algorithms like association rules. Interestingly, this explorative side of advanced analytics neatly leads to the predictive, automated decision-making side of AI (Artificial Intelligence). It is helpful to be good at advanced analytics before you start serious AI.

Data science

Data science is the application of the scientific method to decision-making, based on data representing facts – usually about people or objects and events connecting them. Data science uses statistics, statistical and machine learning, some big data technologies mainly for optimisation needs, and a good deal of often tedious data wrangling. As a data scientist I often play the role of a janitor, cleaning and prepping data – and a lot of my peers agree that over 80% of our workload is data wrangling (aka munging). Being good at that makes a productive data scientist, but it also enables good AI and advanced analytics.

What makes data science powerful is that we apply statistical reasoning when we claim that our findings are somehow significant. Unlike a naive application of machine or statistical learning, we apply a rigorous process of validating our data, our models, and their predictions. Essentially, we can stand by the claims we make because we can explain our results, we avoid blackbox solutions (ie. “computer says no”), and we express our level of confidence or doubt using good old statistics.

Artificial Intelligence

The current most-in-vogue term is not new. AI is more than 60 years old, but it has recently found some practical and impressive applications, especially in the domains of image, voice, and text recognition. However, AI is much more than telling frowns from smiles in photos. Indeed, the most interesting part of AI is yet to come to the masses: automated reasoning.

Being able to make autonomous low-level decisions that increase the chance of reaching predefined success criteria at a higher, macro or organisational level is the holy grail of AI. We are still very far from achieving it, perhaps because as an industry we have taken a side-step into recognition-based applications (back to distinguishing photos of cats from dogs etc) which gave birth to the fairly complex field of neural network deep learning, a subclass of machine learning algorithms that are great at recognising things but not so good at reasoning and decision making.

Having said that, it is possible to safely dip your toes into reasoning and AI by building transparent, that is explicable (not a blackbox) and well-secured systems that work in conjunction with other, not machine learning-based forms of AI. However, this is a subject for another post—in the meantime, feel free to learn more about the risks of going there too early by finding out more about one of my favourite subjects, that of Artificial Stupidity, see: https://projectbotticelli.com/future-series-2019

So now we understand the terms, let’s return to my earlier question: what is the value of machine learning to businesses?

There are five common reasons my customers need it, in this order:

  1. Discover reason behind success, failure
  2. Understand customers, products, patterns
  3. Accurately plan future
  4. Experiment before making decisions
  5. Experiment with autonomous decision making: AI

Few of my customers get to number 5 yet, but they seem to be aiming for it. 60% of my customers need just 1 & 2, fewer than 40% also need 3 & 4.

Validity

My main concern with the widespread use of machine learning is that I too often see a supposedly production-quality predictive model fail to work as well as expected, or perhaps it only works for a short period of time, before decaying. This is a well-known issue stemming from the lack of model reliability testing. Reliability, unlike accuracy of a model, is hard to test. There are techniques, such as cross-validation (CV) that test for it, but those tests can only tell you if a model is unreliable—they cannot guarantee that a model is reliable. In other words, if you fail the CV test, you know you have a problem, but if you pass it, you may…still have a reliability problem!

Too often a model gets deployed even without a full cross-validation test, not to mention no other tests, such a real-world experiments, or an A-B or randomised trial. Above all, even a valid model will always slowly decay as the world around us changes: for example, the business environment and conditions will make most models eventually useless. They need to be updated, and revalidated, all the time.

It is important to ensure that a process of building, deploying, and managing machine learning solutions is built on a strong framework of continuous validation, including testing model accuracy, but also its continuing reliability and usefulness testing. I would say, without hesitation, that teaching my students how to validate models is the most important skill that I have been imparting on my machine learning and data science courses.

It is way too easy to deploy machine learning solutions today without validating them properly. Good validation is not easy, but it is essential to providing sustained value to the business, and for avoiding causing disappointment to the project sponsors, not to mention inconveniencing users or even causing financial losses.

Data-driven decisions

Data-driven decisions (DDD) is a concept that summarises the application of advanced analytics and data science to decision making. There is some serious academic research that shows that the revenue of companies that apply data-driven decisions outperforms their peers by 5–6%. If you would like to convince someone about DDD, see this well-respected paper: Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance? from 2011 by Brynjolfsson, Hitt, Kim, from MIT & Penn’s Wharton School.

Ultimately, that is the value of judiciously applied machine learning: you will perform better in business.

From a corporate perspective, data science also introduces a pattern of human interaction between people who make decisions, those who maintain the data, and those who build machine learning models and do the data science. This creates a human space where significant organizational intelligence can be found. It makes the entire decision-making process more precise, more reliable, less worrying, and perhaps even happier. And it makes experimenting with new ideas more fun!

Oxford Computer Training is now offering my Practical Machine Learning courses in the UK.

More resources

Machine Learning for Security Applications View the webinar recording

Hugh Simpson-Wells and Rafal Lukawiecki discuss Machine Learning for Security Applications Watch the video now