Many people attach great importance to the predictive ability of the Model, but ignore the importance of the interpretability of the model. Why is model interpretability so important? The author elaborates this in five aspects.
Model interpretability shows what your model is learning, and seeing inside the model is more useful than most people think.
I’ve interviewed many data scientists over the past 10 years, and model interpretability is my favorite topic, and I use it to differentiate the best data scientists from the average data scientist.
Some people think of machine learning models as black boxes that make predictions but can’t understand; but the best data scientists have real-world insights from any model. Given any model, these data scientists can easily answer the following questions:
What the model thinks is the most important feature in the data
For any single prediction of the model, how each feature in the data affects that prediction
What interactions between features have the greatest impact on model predictions
The answers to these questions make more sense than most people think. Inspired by this, I started a micro-course on model interpretability on Kaggle. Whether you’re learning through Kaggle or other comprehensive resources like Elements of Statistical Learning, these techniques will revolutionize the way you build, validate, and deploy machine learning models.
Why are these insights valuable?
The five most important applications of Model Insights results are:
debugging
Guided Feature Engineering
Guiding the direction of future data collection
guide human decision-making
establish trust
debugging
There is a lot of unreliable, messy, and noisy data in this world. When you write down preprocessing code, you add a potential source of error. Add in the possibility of target leaks, and in real data science projects, mistakes at some point are the norm, not the exception.
Given the frequency of errors and potentially disastrous consequences, debugging is one of the most valuable skills in data science. Understanding the patterns that the model is looking for helps you identify when the model is inconsistent with your understanding of the real world, which is generally the first step in tracking down errors.
Guided Feature Engineering
Feature engineering is generally the most effective way to improve model accuracy. Feature engineering typically involves transforming raw data or previously created features to iteratively create new features.
Sometimes you can do this with just intuition about the underlying subject. But when there are more than 100 original features or you lack background knowledge of the project at hand, you need more guidance.
A Kaggle contest on predicting loan defaults is an extreme example. There are more than 100 original features in this problem. For privacy reasons, these features are not named by common English names, but by code names such as f1, f2, and f3. This simulates a scenario where you don’t know much about the raw data.
A contestant found the difference between the two signatures f527~f528, creating a powerful new signature. A model with this difference as a feature is much better than a model without this feature. But how can you think of creating this variable when there are hundreds of them?
The skills you learn in this course will allow you to easily tell that f527 and f528 are important characteristics and that they are closely related. This will guide you to consider transforming these two variables to find the “golden features” of f527-f528.
Today’s datasets can easily contain hundreds or thousands of raw features, so the importance of this method is increasing day by day.
Guiding the direction of future data collection
You have no control over the datasets downloaded online. But many businesses and organizations that use data science have the opportunity to expand the types of data they collect. Collecting new types of data is expensive and inconvenient, so they will only collect data that is worth the effort. Model-based insights can give you a better understanding of the value of current features, which will help you deduce which new values are most useful.
guide human decision-making
Some decisions are made automatically by models—when you log on to Amazon, no one on the site decides what to show you in a hurry. But there are many important decisions that have to be made by humans. For these decisions, the insights of the model are more valuable than the predictive power.
establish trust
Without validating the ground truth, people will not trust your model and will not make important decisions based on your model. This is a sensible precaution in terms of how often the data goes wrong. In practice, presenting insights that match their general perception helps build users’ trust in the model, even if those users have little knowledge of data science.