Motivation
One of the standard tools for evaluating a classifier’s performance is the ROC curve and the area under the curve (the AUC). For the classifiers I run at work, predictions of upcoming events are usually stored in some SQL table and refreshed with updated predictions from time to time. During this refresh process, it’s a good idea to look back at previous predictions and compare them to what really happened.
Generating AUC and other metrics of model performance can be done in Python or other data-friendly programming languages, but I recently started using SQL queries like the ones I’ll introduce here instead, in order to incorporate AUC calculation into the refresh process more seamlessly. It was also fun, and maybe a little enlightening, to work out how AUC can be calculated using pure SQL.
Setup
The code given at the end of this post will assume that there is a temporary table called #MODEL with the following columns:
- CATEGORY has group identifiers, for breaking the results into separate AUC’s.
- PREDICT has the scores assigned by the model.
- TRUTH has the real-life outcomes, in the form of 1’s and 0’s.
If you don’t need the results broken down into groups with separate AUC’s, just give every row of the table the same CATEGORY. The scores in the PREDICT column can be in any numerical form, as long as it’s “the bigger, the better.”
For example, the largest machine learning model I maintain right now involves predicting many individuals’ purchases of several product types across multiple countries. The predictions (raw model scores) for an upcoming month are stored in a certain SQL table, and at the end of the month I can check our transactional data tables to see whether each individual actually purchased each product type. To prepare to calculate AUC’s, broken down by product type in each country, I can create the required #MODEL table using a query like this:
select COUNTRYNAME + ' ' + PRODUCT_TYPE as CATEGORY, RAWSCORE as PREDICT, case when SALES > 0 then 1 else 0 end as TRUTH into #MODEL from <table of predictions joined to table of actual transactions>
An example done by hand
Before getting to the SQL queries, let’s walk through a small example by hand. It is my hope that this will help make the later queries more understandable.
Suppose the following 30 rows are the entire contents of our #MODEL table.
CATEGORY | PREDICT | TRUTH | CATEGORY | PREDICT | TRUTH | |
A | 0.4895 | 0 | A | 0.0395 | 0 | |
A | 0.0483 | 0 | A | 0.8702 | 1 | |
A | 0.1381 | 0 | A | 0.0216 | 0 | |
A | 0.5444 | 1 | A | 0.6847 | 0 | |
A | 0.7835 | 0 | A | 0.7943 | 1 | |
A | 0.3123 | 0 | A | 0.6628 | 0 | |
A | 0.4802 | 0 | A | 0.5026 | 0 | |
A | 0.7255 | 1 | A | 0.8990 | 1 | |
A | 0.9688 | 1 | A | 0.6003 | 0 | |
A | 0.9545 | 1 | A | 0.8107 | 0 | |
A | 0.9097 | 1 | A | 0.9599 | 1 | |
A | 0.8498 | 1 | A | 0.8565 | 0 | |
A | 0.5321 | 0 | A | 0.9398 | 0 | |
A | 0.2862 | 0 | A | 0.2400 | 0 | |
A | 0.7329 | 0 | A | 0.8720 | 0 |
In the code that follows later, the number of rows (30 here) is called ALLCNT, and the number of actual positive outcomes (10 here) is called POSCNT. To keep the present example simple, there is only one CATEGORY, so we will only be calculating a single AUC. The very first step is to sort the table by PREDICT, in descending order:
CATEGORY | PREDICT | TRUTH |
A | 0.9688 | 1 |
A | 0.9599 | 1 |
A | 0.9545 | 1 |
A | 0.9398 | 0 |
A | 0.9097 | 1 |
A | 0.8990 | 1 |
A | 0.8720 | 0 |
⋮ | ⋮ | ⋮ |
Next, condense the table down into “bins” of roughly equal size. The more bins we use, the more precise the end result will be. In this example, we will use 10 bins, so 3 rows of the original (sorted) table will fall into each bin.
The first three rows of the table above (the ones with the highest PREDICT values) fall into bin 10. In those rows, there are three positive cases, so for bin 10, we’ll have TRUTH be 3. The next three rows become bin 9, with TRUTH there being 2. The rest of the binned results are shown below.
After this, for each bin, we calculate the cumulative false positive rate (FPR) and true positive rate (TPR). For bin 10, the FPR is 0/20, because there are zero negatives in bin 10, out of 20 negatives total. The TPR is 3/10, because 3 out of the 10 positives are in bin 10. Next, for bin 9, the FPR is 1/20, because 1 of the 20 negatives fell into bins 9 or 10. The TPR is 5/10, because 5 of the 10 positives fell in bins 9 or 10. For bin 8, we see that 3 of the 20 negatives fell in bins 8 through 10 (FPR 3/20) and 6 of the 10 positives fell in bins 8 through 10 (TPR 6/10).
Continuing in this way, we end up with the following results:
CATEGORY | BIN | CNT | TRUTH | FPR | TPR |
A | 10 | 3 | 3 | 0 | 0.3 |
A | 9 | 3 | 2 | 0.05 | 0.5 |
A | 8 | 3 | 1 | 0.15 | 0.6 |
A | 7 | 3 | 2 | 0.2 | 0.8 |
A | 6 | 3 | 1 | 0.3 | 0.9 |
A | 5 | 3 | 0 | 0.45 | 0.9 |
A | 4 | 3 | 1 | 0.55 | 1 |
A | 3 | 3 | 0 | 0.7 | 1 |
A | 2 | 3 | 0 | 0.85 | 1 |
A | 1 | 3 | 0 | 1 | 1 |
The ROC curve is created by plotting the (FPR, TPR) points from this table and connecting them with line segments. Of course, our end goal is not this curve itself, but the area under it.
The AUC can be calculated by breaking the total area up into trapezoids (9 trapezoids in this example) and summing their areas. For example, consider the trapezoid highlighted above. Its area is its width multiplied by the average of the heights of its two vertical sides: (0.15 – 0.05) × (0.5 + 0.6) / 2. Using this formula for all the trapezoids and summing, we find that the AUC is 0.875.
The queries
Once we have a table called #MODEL with the columns described earlier, here is how the AUC(s) can be calculated in SQL. Our first query finds the number of rows and the number of actual positive outcomes for each category:
select CATEGORY, sum(TRUTH) as POSCNT, count(*) as ALLCNT into #SUMS from #MODEL group by CATEGORY
We must ignore any category that has either all or none of its outcomes positive (to avoid division by zero later):
delete from #SUMS where POSCNT = 0 or ALLCNT = POSCNT
We condense the original #MODEL table down into bins of roughly equal size, counting the rows and the number of actual positives in each bin:
select CATEGORY, BIN, count(*) as CNT, sum(TRUTH) as TRUTH into #CONDENSED from ( select CATEGORY, TRUTH, ntile(100) -- increase for a more precise calculation over (partition by CATEGORY order by PREDICT) as BIN from #MODEL ) M group by CATEGORY, BIN
We calculate the cumulative false positive rate (FPR) and true positive rate (TPR) for each bin:
select C1.CATEGORY, cast(sum(C2.CNT)-sum(C2.TRUTH) as float)/(ALLCNT-POSCNT) as FPR, cast(sum(C2.TRUTH) as float)/POSCNT as TPR, row_number() over (partition by C1.CATEGORY order by C1.BIN desc) as i into #RATES from #CONDENSED C1 join #CONDENSED C2 on C1.CATEGORY = C2.CATEGORY and C2.BIN >= C1.BIN join #SUMS S on S.CATEGORY = C1.CATEGORY group by C1.CATEGORY, C1.BIN, POSCNT, ALLCNT
Finally, we calculate the AUC(s) using the trapezoid idea:
select A.CATEGORY, sum((B.FPR - A.FPR) * (A.TPR + B.TPR) * 0.5) as AUC from #RATES A join #RATES B on A.CATEGORY = B.CATEGORY and A.i + 1 = B.i group by A.CATEGORY
Using indexes on some of the temporary tables involved in this process would probably improve its speed, but I typically run these queries at work for a #MODEL table with 12 million rows, divided among about 600 categories, without indexing, in about 1 minute.