Introduction
Many categories of products naturally fall into price “bins” that act almost like a separate level of categorization in their own right. A textbook example of this is laptop computers, where manufacturers often distinguish between “low-end” and “high-end” offerings. A similar example is graphics cards, which by my estimation probably fall into 3 or 4 price bins.
Recently at work, I have been experimenting toward developing an automated procedure to break a given product category into price bins in a natural way – i.e. a way that agrees well with human intuition.
Why not use k-means clustering
When this price binning question was first raised, I thought, “This sounds a bit like a clustering problem, albeit a strangely one-dimensional version. How about using the well-known k-means algorithm?” But after further consideration and a few thought experiments, I realized that k-means would not be suitable.
Let’s say we have a product category in which there are a few thousand distinct products with prices (or “unit costs”) distributed as shown below.
This set of price data is an artificial example, but we will soon enough work with something more real. (Also, the distinction between a product’s “price” and its “unit cost” will be ignored here.)
It should be clear from the histogram above that the prices in this product category can be neatly grouped into at least 2 separate bins. If we apply the k-means clustering algorithm to the data, the results are as follows:
Notice that some of the products that we would intuitively group on the low end (in blue) have instead been placed in the high end (in orange). This is because of the way the k-means procedure works: a certain number of “cluster centers” (two in this case) are identified, and each data point is assigned to the cluster with the nearest center. In effect, there is a dividing line (as shown above) exactly halfway between the cluster centers.
By contrast, here are the results when we apply an agglomerative clustering method (also known as bottom-up hierarchical clustering) to the data:
While unsupervised learning is not really my area of expertise, in my limited real-world experience with clustering tasks I have found that the agglomerative approach usually produces more sensible/intuitive results than the basic k-means algorithm.
A real example with code
I’ll now walk through some of the code I have been putting together as I work on the question of how best to divide products into price bins. All the code given below can be found here on github.
We’ll use the following tools:
import pandas as pd from scipy.stats import boxcox from sklearn.cluster import AgglomerativeClustering import matplotlib.pyplot as plt import numpy as np
Now for some data. In my own application, the price data for each category of products is drawn from a SQL query similar to this:
select SKU, UNITCOST, sum(REVENUE_6MO_USD) as REV from <aggregated table of recent sales data> where PRODUCT_TYPE = <some product type of interest> group by SKU, UNITCOST
For each individual product, we know its unit cost. We also know the total amount of revenue the product has recently brought in; I include it because there has been some interest in keeping track of how each product category’s price bins divide up the total revenue within the category.
Obviously I can’t really share any further details of the internal data set I am working with, so for demonstration purposes, I’ve dug up a reasonably similar set of data involving the prices of video games, which can be loaded as follows:
df = pd.read_html( 'http://garaph.info/sqldirect.php?queryid=3312', header = 0, index_col = 0 )[0]
The prices appear to be given in yen, so let’s convert them to approximate U.S. dollars by dividing by 100. The data set also provides the “TotalSales,” representing units sold, which we’ll convert to a measure of revenue, in order to match the nature of my own real-life data as much as possible.
df['LaunchPrice'] /= 100 df['TotalSales'] *= df.LaunchPrice
Let’s rename the dataframe’s columns to match the ones from my SQL query:
df.columns = ['NAME', 'UNITCOST', 'REV']
In terms of price, there are some products in this data set that are definite outliers. According to a couple of common criteria, it seems we should consider prices outside of the range from about $16 to $103 to be outliers. But let’s be conservative about this (i.e. hesitant to throw away data), and simply cut off anything above $120:
df = df[df.UNITCOST <= 120]
This only eliminates a handful of outliers. In my real data set, I happen to eliminate a few data points that are outliers in a very different sense than this, but nevermind that. Outliers seem to make little difference to the agglomerative clustering technique as it’s used here; we’re removing them mainly for visualization purposes.
From one product category to another, the shape of price distributions can vary to a surprising degree. Clustering algorithms will give poor results on extremely skewed distributions, but a basic way to automatically correct for this difficulty is to apply a one-parameter Box-Cox transformation. This is easily done in Python; the following line of code generates a transformed version of the unit cost data (I’ll call it “X” for lack of imagination), and also stores the optimal value of the transformation parameter λ:
df['X'], λ = boxcox(df.UNITCOST)
For convenience in producing visualizations later, let’s define the Box-Cox transformation function and its inverse.
def bc(x): return (x**λ - 1)/λ if λ != 0 else np.log(x) def ibc(x): if λ == 0: return np.exp(x) if x*λ + 1 < 0: return None return (x*λ + 1)**(1/λ)
Like the k-means approach, agglomerative clustering requires that we decide beforehand how many clusters we wish to use. To help decide, we can consult some measure of clustering performance. The scikit-learn implementation of k-means clustering has a built-in scoring function, but for agglomerative clustering we must define our own. The following scoring function gives the average distance of data points to their assigned clusters’ means:
def score(data, clusters): distsum = 0 for c in set(clusters): distsum += sum(abs( data[clusters == c] - data[clusters == c].mean() )) return distsum / len(data)
Note that the lower the score, the better the average data point fits in with its assigned cluster. Now, let’s run the procedure with a variety of cluster numbers and observe the resulting scores.
score_x = [] score_y = [] for nc in range(1, 20): ac = AgglomerativeClustering(nc) ac.fit(df.X.values.reshape(-1, 1)) score_x.append(nc) score_y.append(score(df.X, ac.labels_)) plt.rcParams['figure.figsize'] = 18, 6 plt.rcParams['font.size'] = 16 plt.plot(score_x, score_y, lw = 3) plt.xticks(score_x) plt.yticks([]) plt.xlabel('Number of clusters') plt.ylabel('Score') plt.grid() plt.show()
Here are the results:
By the rationale of the “elbow method,” a good number of clusters to use here could be 2, 4, or 5. Let’s take a much closer look at each of these possibilities.
for nc in [2, 4, 5]: # Get the cluster assignments ac = AgglomerativeClustering(nc).fit( df.X.values.reshape(-1, 1) ) df['cluster'] = ac.labels_ # Find the boundaries between the clusters cdf = df.drop(['X','REV'], 1).groupby( 'cluster' ).describe() cdf.columns = [c[1] for c in cdf.columns] bounds = ( (cdf.sort_values('mean')['max'].values[:-1] + cdf.sort_values('mean')['min'].values[1:]) / 2 ) # Re-number the clusters in logical order df['cluster'] = df.cluster.replace({ r[1].cluster : r[0] for r in cdf.sort_values('mean').reset_index().iterrows() }) # Build special dataframe for visualizing price bins dfdict = {} for c in set(df.cluster): dfdict['cluster%d' % c] = df[df.cluster == c].X # Visualize the price bins pd.DataFrame(dfdict).plot.hist( stacked = True, bins = 50, rwidth = .95, legend = False ) xt = plt.gca().get_xticks() xt = np.linspace(min(xt), max(xt) + 1, 20) xt = [x for x in xt if type(ibc(x)) == np.float64] plt.xticks(xt, [int(ibc(x)) for x in xt]) plt.xlabel('unit cost (approx. USD)') plt.ylabel('# of products') # Make a legend with the revenue shares plt.legend([ '%.1f%% by revenue' % (r*100) for r in df.groupby('cluster').agg({ 'UNITCOST' : 'median', 'REV' : 'sum' }).REV / df.REV.sum() ]) # Show the boundaries ytop = plt.gca().get_ylim()[-1] for b in bounds: plt.plot( [bc(b), bc(b)], [0, .8*ytop], 'r--', lw = 2 ) plt.text( bc(b), ytop, '%.0f' % b, rotation = 90, fontdict = { 'color' : 'red', 'size' : 20, 'va' : 'top', 'ha' : 'center' } ) plt.title('Product prices in %d segments' % nc) plt.show()
The possible bins
When products are split into two price bins, the result looks like this:
The line between regular and high-priced products is around $66. If we use four bins, the bin shown in blue above is further segmented into three separate groups:
Using five bins, the highest-priced products are divided into two groups:
Choosing the “correct” or “best” number of clusters to use for a category of products, such as those used in this example, is fairly subjective and depends on business viewpoints and trade-offs that are beyond the scope of this post. Regardless, I don’t believe that any of the three choices visualized above would be unreasonable.