## Intro

This is a brief look at some of the bits and pieces of code I’ve put together to do weird/interesting/cool things (mostly at work) with machine learning classification models. All of my code here is heavily reliant on the pandas and scikit-learn libraries, though it wouldn’t be too difficult to rewrite most of it to avoid the need for pandas. The code can be found in its entirety here on github.

## Demo data set

Imagine four “blobs” of data points, centered at the corners of a square, but partly overlapping. See the diagram on the left below. Points in the upper-right and lower-left blobs are **positive** cases (shown in blue), and the other two blobs are made of **negative** cases (in red). Now, imagine four more blobs just like these, but “below” the first four (in three dimensional space). The second and third diagrams here give a more thorough overview of this simple data set:

For the demonstrations below, let’s randomly split the data into a training set and a test set:

from sklearn.model_selection import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

We’ll use this simple data set going forward, but all the code introduced here should work just as well with any data set in the form of pandas dataframes.

## Feature importances

By symmetry, feature1 and feature2 should be equally important in learning to classify points in this data set as positive or negative, while feature3 should be completely unimportant.

Tree-based classifiers in scikit-learn provide a measure of feature importance sometimes known as Gini importance. Let’s build a simple random forest classifier from our training data:

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_jobs = -1, n_estimators = 100) clf = clf.fit(Xtrain, ytrain)

Here is the code I generally use to pull out and organize the feature importances from a classifier in scikit-learn:

importances = pd.DataFrame(X.columns, columns=['feature']) importances['importance'] = clf.feature_importances_ importances = importances.set_index('feature').sort_values( 'importance', ascending=False )

The resulting importances are as follows — feature1: 0.380, feature2: 0.377, feature3: 0.243.

Don’t worry too much about the “units” here; what really matters is the relative size of each measurement of importance. According to scikit-learn, feature3 appears to be less important than the other two, but still far from zero importance.

There is another measure of feature importance that I’ve found to be superior to the previous one in most ways. We’ll refer to this one as permutation importance. The idea is this: see what happens if you mix up all the values of one of the features in the testing data. If that feature is important, your classifier should be much less successful when the feature is scrambled. But if the feature is unimportant, scrambling it should have no effect on the classifier’s performance. I’ve written a simple function to calculate permutation importances. To use it in our example here, I would simply replace the relevant line in the code above to the following:

importances['importance'] = permutation_importances(clf, Xtest, ytest)

Here are the inner workings of the new function:

from sklearn.metrics import roc_auc_score def permutation_importances(clf, Xtest, ytest): auc = roc_auc_score(ytest, clf.predict_proba(Xtest)[:,1]) p_imp = [] for column in Xtest.columns: Xtemp = Xtest.copy() Xtemp[column] = Xtemp[column].sample(frac = 1.0).values p_imp.append( auc - roc_auc_score(ytest, clf.predict_proba(Xtemp)[:,1]) ) return p_imp

The permutation importances are 0.284 for feature1, 0.287 for feature2, and -0.011 for feature3, much more in agreement with our expectations for this artificial data set.

In real-life applications of machine learning, feature importances are a vital part of trying to make sense of the reasoning (so to speak) behind a model’s decisions. But that level of understanding is not always necessary or desired. Still, in situations where a model’s performance far outweighs any consideration of its underlying “reasoning,” feature importances are often an invaluable tool. With their help, we can judge which features are worth including in a data set, and may even find some clues about new features to pursue.

Speaking of feature selection, no discussion of extending the capabilities of scikit-learn would be complete without a nod to the wonderful mlxtend library. Among many other things, it includes some feature selection capabilities that I’ve used in situations that may have been intractable with purely scikit-learn tools.

## Evaluating a classifier’s performance

Once a classifier is trained, it is of obvious interest to know how well the model does its job. One of the standard tools for evaluating a classifier’s performance is the ROC curve and the area under the curve (the AUC). A classifier that does no better than random guessing should have a diagonal line for its ROC curve, with an AUC of 0.5, while a perfect classifier’s AUC should be 1.0. The scikit-learn library has a function for generating ROC curves, but it still requires a surprising amount of code to create a decent-looking graph; I used the following function to produce the ROC curve shown down below for our example classification problem.

import matplotlib.pyplot as plt from sklearn.metrics import roc_curve plot_roc(clf, Xtest, ytest): pred = clf.predict_proba(Xtest)[:,1] fpr, tpr, _ = roc_curve(ytest, pred) plt.plot([0, 1], [0, 1], linestyle='--', c = 'black', lw = .5) plt.plot(fpr, tpr, c='red', lw = 3) plt.xlim([0,1]) plt.ylim([0,1]) plt.xlabel('FPR') plt.ylabel('TPR') plt.title('ROC Curve (AUC %.4f)' % roc_auc_score(ytest, pred)) plt.show()

Related to the ROC curve is the precision-recall curve, shown below for our example problem. Again, scikit-learn provides a function for generating such a curve, but it takes some effort to display the curve in the way I choose. In my work, what I usually choose is to mark and label some particular point on the PR curve as a specific illustration. For this purpose, my custom PRC function allows the user to input a value for either the precision or the recall so that the best matching point on the graph may be displayed:

```
from sklearn.metrics import precision_recall_curve as prc
def plot_prc(
clf, Xtest, ytest,
precision = None, recall = None, threshold = None
):
pr, rc, ts = prc(ytest, clf.predict_proba(Xtest)[:,1])
pr, rc, ts = list(pr), list(rc), list(ts)
i = None
if precision is not None:
i = ts.index(min(t for t, p in zip(ts, pr) if p > precision))
elif recall is not None:
i = ts.index(max(t for t, r in zip(ts, rc) if r > recall))
elif threshold is not None:
i = ts.index(max(t for t in ts if t < threshold))
plt.plot(pr, rc, c='red', lw = 3)
if i is not None:
plt.plot([pr[i]], [rc[i]], marker = 'o', color = 'black')
plt.text(
pr[i], rc[i], '(%.2f, %.2f) ' % (pr[i], rc[i]),
fontdict = {'ha':'right', 'va':'center'}
)
plt.grid()
plt.xlim([0,1])
plt.ylim([0,1])
plt.title('Precision & Recall')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.show()
```

Closer examination of the testing data and classifier output shows that we can achieve this result (91% precision and 17% recall) if we select the top 9% of data points, as ranked by the classifier. In other words, 17% of all positives are found in the top-ranked 9% of data points. Meanwhile, only about 1% of all negatives are found in the top 9%. Putting together many observations like these, we can form a less-common type of graph called a Kolmogorov-Smirnoff chart, as seen to the right. Based on the chart, if we select the top 20% of data points, around 34% of all positives will be among them, and only 7% of all negatives. In the top 50%, we have around 71% of all positives and 28% of all negatives. It is at this point that the curve for positives and the curve for negatives reach their largest separation — 43% — which is known as the KS statistic, a reasonable measure of classifier quality similar to the AUC.

My function for generating the KS chart and statistic is as follows:

def _std_results(clf, Xtest, ytest): '''Helper function for the generating various custom graphs.''' return pd.DataFrame({ 'truth' : ytest, 'pred' : clf.predict_proba(Xtest)[:,1] }).sort_values('pred', ascending = False) def plot_ks(clf, Xtest, ytest): ''' Plot the Kolmogorov-Smirnov chart. Returns the KS statistic. ''' x = np.linspace(0, 1, 21) results = _std_results(clf, Xtest, ytest) y_t = results.truth.sum() y_m = [ results.head(int(len(results)*p)).truth.sum() / y_t for p in x ] y_f = (results.truth == 0).sum() y_n = [ (results.head(int(len(results)*p)).truth == 0).sum() / y_f for p in x ] KS = [y_m[i] - y_n[i] for i in range(len(x))] KSi = KS.index(max(KS)) plt.plot(x, y_m, 'b', x, y_n, 'r', lw = 3, zorder = 10) plt.plot([x[KSi],x[KSi]], [y_n[KSi], y_m[KSi]], c = 'gray', lw = 5) plt.text( x[KSi] + .02, (y_n[KSi] + y_m[KSi])/2, 'KS: %.2f' % max(KS), zorder = 9000, fontdict = {'ha': 'left', 'va': 'center', 'rotation': 90} ) plt.xlabel('% from top') vals = plt.gca().get_xticks() plt.gca().set_xticklabels(['{:3.0f}%'.format(x*100) for x in vals]) vals = plt.gca().get_yticks() plt.gca().set_yticklabels(['{:3.0f}%'.format(x*100) for x in vals]) plt.title('Kolmogorov-Smirnov') plt.legend(['% of all positives', '% of all negatives']) plt.show() return max(KS)

The last somewhat standard visual we’ll see here is the cumulative lift chart. As mentioned before, if we select the top 20% of data points in our example based on the rank assigned by the classifier, 34% of all positives will be among them. This means that 85% of those selected data points are positive. Compare that to what would happen if we simply chose any group of data points at random, in which case we should expect 50% of them to be positives. Now, 85/50 is 1.7, and this number is called the “lift” for the top 20%. As another example, if we select only the top 5% of data points, almost 100% of them will be positives, which is a lift of almost 2. The graph below to the left shows the cumulative lift chart for our example data and classifier, as created by the following function:

```
from scipy.spatial import ConvexHull
from scipy.interpolate import UnivariateSpline
def plot_cumlift(clf, Xtest, ytest, show_spf = False):
'''
Plot the cumulative lift chart.
Returns an approximate lower convex envelope of the graph.
'''
x = np.linspace(0, 1, 21)
results = _std_results(clf, Xtest, ytest)
y_l = [
results.head(int(len(results)*p)).truth.mean()
/ results.truth.mean() for p in x[1:]
]
hull = ConvexHull([[i,j] for i,j in zip(x[1:], y_l)])
ihull = np.roll(hull.vertices, -list(hull.vertices).index(0))
ihull = ihull[:list(ihull).index(max(ihull))+1]
xhull = [x[1:][i] for i in ihull]
yhull = [y_l[i] for i in ihull]
spf = UnivariateSpline(xhull, yhull, k = 1, s = 0, ext = 'const')
xplot = np.linspace(0, 1, 1000)
plt.plot(
x[1:], y_l,
linewidth = 1 if show_spf else 3,
marker = 'o' if show_spf else None,
c = 'blue' if show_spf else None
)
plt.xlabel('% from top')
plt.ylabel('Cumulative lift')
vals = plt.gca().get_xticks()
plt.gca().set_xticklabels(['{:3.0f}%'.format(x*100) for x in vals])
plt.title('Cumulative Lift Chart')
if show_spf:
plt.plot(xplot, spf(xplot), lw = 2, c='red', zorder = 42)
plt.legend(['Actual lift', 'Lower envelope'])
plt.show()
return spf
```

As you can see, this plotting function also creates and returns an approximate lower convex envelope for the cumulative lift chart, which can be graphed with the use of an optional keyword argument. This functionality is shown to the right:

You might wonder what possible use I could have for the lower convex envelope of the lift chart. I’ll explain this in another post.

## Individual-level visualization

Here’s a type of visual that I find useful in communicating the ability of machine learning classifiers to prioritize individuals that are likely to be positives, separating them from the negative cases. If we line up all the test data points in order of the rank they receive from our classifier, and color each one according to its actual label (blue for positive and red for negative), the results look like this:

For reference, here is the result a perfect classifier would produce, and also the result of random guessing:

I find these individual-level visuals more compelling in the setting of a real-life prediction problem, especially where the number of positives is relatively small. Here’s the function used to generate the visuals, which stretches the intended usages of pyplot possibly a bit too far.

def indiv_plot(clf, Xtest, ytest, width=12, height=9, fillfactor=2500): ''' Visualize classifier performance on an individual level. Optional arguments: `width` and `height` determine the visual's shape, `fillfactor` (optional) controls size of markers ''' dims = width, height plt.rcParams['figure.figsize'] = dims plt.rcParams['font.size'] = sum(dims) // 2 results_xy = _std_results(clf, Xtest, ytest).reset_index() rowlen = int((dims[0] / dims[1] * len(results_xy)) ** .5) results_xy['x'] = results_xy.index % rowlen + 1 results_xy['y'] = results_xy.index // rowlen + 1 collen = results_xy.y.max() results_xy['y'] = collen - results_xy.y + 1 results_xy.plot( 'x', 'y', c = ['blue' if t == 1 else 'red' for t in results_xy.truth], kind = 'scatter', marker = 's', s = dims[0]*dims[1]*2500 / (rowlen*collen) ) for i in range(12): _, j, k = results_xy.iloc[i].values plt.text( j, k, str(i+1) if i < 9 else '...', fontdict = { 'color':'white', 'weight':'bold', 'ha':'center', 'va':'center' } ) plt.text( results_xy.iloc[-1].x + .5, 1, '← lowest-ranked', fontdict = { 'color':'black', 'weight':'bold', 'ha':'left', 'va':'center' } ) plt.axis('off') plt.tight_layout() plt.show()

## Labeled confusion matrix

To be perfectly honest, I rarely find any use for confusion matrices in my day-to-day work, and this is a bit of a dull topic to end with, but there’s a small adjustment to scikit-learn’s confusion matrix function that I’ve found helpful on one or two occasions.

Here is a confusion matrix for our example classifier, using a score threshold of 0.8:

A score threshold of 0.5 is used implicitly throughout the scikit-learn library. That is, data points receiving a score over 0.5 from a classifier are considered to be predicted positives, and those below are predicted as negative. By adjusting the threshold to 0.8, we make the requirements for a positive prediction more strict, which improves the precision of those predictions, at the cost of predicting fewer of the actual positives overall.

Apart from a flexible threshold, the main purpose of the following slightly enhanced confusion matrix function is to place explanatory labels on the matrix, as seen above.

from sklearn.metrics import confusion_matrix as cm def cm_labeled(clf, Xtest, ytest, threshold = 0.5): return pd.DataFrame( cm( ytest, clf.predict_proba(Xtest)[:,1] >= threshold, labels = [1,0] ), columns = ['Predicted positive', 'Predicted negative'], index = ['Actually positive', 'Actually negative'] )