The National Oceanic and Atmospheric Administration (NOAA) has an FTP site with hourly air temperature measurements (among other things) from various sensor stations across the U.S. As we’ll see, these data sets can be fun to explore and play with; in this post, I will use all data from the station designated “NC Asheville 13 S.” I suppose I chose this station because I have some deep family roots in Asheville, and also because NOAA’s records seem to go back farther for this station than anywhere else.
My full Python code is available at this link. I’ll show only a few key parts of it here. The list of import statements is rather long, but here are the ones that are most relevant to the pieces of code that will be shown here:
from matplotlib.dates import date2num from scipy.optimize import curve_fit import numpy as np
Loading, cleaning, visualizing
Each year of NOAA data is stored in a plain text file, and these files are roughly 2MB each. My code pulls all available data for Asheville 13 S (around 30MB total right now) and stores it locally. If the code is run again later, only the most recent year of data is updated.
We must also get a separate file of headers from the FTP site, because the data files only contain actual rows of data (no column names).
Upon examination, we can note that from the beginning of the data record (some time in the year 2001), every hour of every day has been represented as a row, without exception. However, it is clear that correct data is missing for a small proportion of rows, where air temperature (along with other measurements) is marked as a large negative number. The longest stretch of consecutive missing temperatures is 4.3 days long, but as a whole, 99.4% of all rows have valid entries for air temperature.
We’ll take two separate approaches to the missing values. For now, we’ll simply ignore (i.e. delete) those rows of the data set and use only the valid measurements. Later, we’ll need to do something slightly more sophisticated.
With the invalid measurements removed, here is our first visualization of the air temperature data:
Air temperature in Asheville follows a very sinusoidal pattern. This is easier to see if we plot just a couple of years at a time; for example, here is March 14, 2005 through March 14, 2007:
In fact, there are other measurements in NOAA’s data that produce a cleaner sine wave, such as deep soil temperature readings. Here they are from May 2011 through May 2013:
There are all sorts of questions one might ask about these soil temperature readings, but from here on, I’ll stick to air temperature, which is much messier and therefore more interesting.
Let’s try to model air temperature as a sine function, of the form y = a sin(b·(x – c)) + d, for some numbers a, b, c, and d that will be chosen to fit the data. As you might have learned in a trigonometry course, the number a determines the amplitude of the sine wave. From the graph of air temperature, it appears that the amplitude is around 20ºF. The number b determines the period of the wave; for temperatures, common sense suggests a cycle around 365 days long, in which case b would be 2π/365. The number c determines the horizontal shift of the wave, which I think is fairly uninteresting here, and d determines the vertical shift, which appears to be around 50ºF in the graph of air temperature.
To find the best-fitting sine function, we first define the general form of the sine function. This function will consider x to be the number of days after the earliest data point.
mindate = df.date.apply(date2num).min() def sine(x, a, b, c, d): x_ = x - mindate return a * np.sin(b * (x_ - c)) + d
Now, we’ll create a list of x values from the data set’s dates, and find the best values for a, b, c, and d to fit air temperature (which I’ve named “Fahrenheit”).
x = [date2num(t) for t in df.date] p, _ = curve_fit( sine, x, df.Fahrenheit, p0 = [20, 2*np.pi / 365, 0, 50] ) period = (2*np.pi / p)
The best-fit sine function has a period of 365.13 days. Below is a comparison of the data with that sine function, and it looks like a pretty good fit!
Digging deeper for any patterns this basic model is missing, let’s examine the residuals. To calculate them:
resid = df.set_index('date').Fahrenheit - sine(x, *p)
Here is a plot of the entire set of residuals over time:
There are a few subtleties that a simple sine function doesn’t capture. On the largest scale, there appears to have been a somewhat steady trend of rising temperature in the past two decades. This can be seen by plotting a 5-year rolling average of the residuals:
Perhaps it would make sense to add a very small linear term to the sine model. However, the effect is so slight over the length of time we’re working with that I’ll neglect it here. There are far more dramatic patterns overlooked by our model so far. To see them, let’s plot the residuals from the warmest and coolest times of the day. For example, to plot the 6:00AM residuals:
resid[resid.date.dt.hour == 6].set_index('date').plot( marker = 'o', markersize = 2, lw = 0, alpha = .5 )
Plotting the 3:00PM residuals as well, we see the following:
The model is leaving out a huge source of variation by not considering daily temperature cycles. To see this more clearly, look at the average residual by hour of the day over the whole data set:
On the other hand, our simple sine function does mostly account for patterns operating at the level of what day of the year it is:
There is something a little weird about the residuals around days 150 to 250. This roughly corresponds to the months of June, July, and August each year. There is noticeably less variation in residuals during these months, and less symmetry in the distribution of the residuals.
It turns out that this summertime effect is more pronounced for overnight hours – specifically from 11:00PM to 5:00AM.
Here is the distribution of residuals for the entire data set. Note the slight lack of symmetry:
This is mainly due to the very asymmetrical distribution of residuals in summer and at night:
Residuals at other times of the year, or strictly during daytime hours, have distributions much closer to bell-shaped:
In light of this, if we return to some of the earlier visualizations and look closely, we can see that there appears to be a kind of “hard floor” on summer heat in Asheville, with the temperature very rarely descending below about 65ºF in those months, and most nights in those months having a temperature between 65ºF and 70ºF.
Second level of sine modeling
The simplest improvement to be made to the sine model from before is to add a second sine function to it, with a forced period of 24 hours:
def sine_h(h, a, c, d): return a * np.sin(2*np.pi/24 * (h - c)) + d p_h, _ = curve_fit( sine_h, resid.date.dt.hour, resid.Fahrenheit, p0 = [5, 0, 0] )
Here is the whole set of residuals from before, plotted by hour of the day, along with this second sine function:
Given the lesser variation in residuals in summer months compared to other months, I have wondered whether this second sine function’s best-fitting amplitude might be different depending on which months it is fit to. As it turns out, if we fit to residuals from summer months only, the amplitude is about 8.25ºF. For other months, the amplitude is hardly different, and in fact slightly smaller, at 8.15ºF.
Here is a look at the new two-level sine model (i.e. the first sine function added to this second sine function) along with the data set:
This graph isn’t very informative; it’s better to compare the model and data on shorter time scales, selected at random. Here are two examples:
The two-sine model provides a remarkably good fit on what I would call “typical” days, such as in the first graph, but it can’t foresee unusually cold spells like the one from December 19 to 21, 2003.
The RMSE of the first sine model was about 10.15ºF. For the two-sine model, it’s down to 8.34ºF. The residuals can be calculated as follows:
resid = df.set_index('date').Fahrenheit resid -= sine(x, *p) + sine_h(df.date.dt.hour.values, *p_h)
They now have a pleasantly symmetric distribution:
A more elegant approach
The idea we’ve pursued so far is to treat temperature as the sum of two cyclical functions, one with a period of 365 days (or so), and the other with period 24 hours. We have essentially been engaging in an intuitive but drawn-out process of Fourier analysis. An alternative approach that is cleaner and quicker but less intuitive would be to apply a discrete Fourier transform to our data set. In brief, this somewhat mystical tool will find the period and amplitude of every oscillatory component in our data, including the two we saw before.
Perhaps the main drawback to using a discrete Fourier transform, like NumPy’s rfft function, is that it will work best if our data spans a whole number of cycles. Also, the data points should be spaced equally through time (i.e. there cannot be any missing data points). So now, we’ll go back to the original data set, with its missing rows still in place, and fill in those unknown temperature readings by interpolation. Then we’ll trim the interpolated data set so that it spans an integer number of years:
df = df.interpolate() startdt = df.date.iloc starters = (df.date.dt.dayofyear == startdt.dayofyear) starters &= (df.date.dt.hour == startdt.hour) df = df.iloc[:max(np.argwhere(starters))]
For example, if the first available temperature reading was on a June 5 at 3:00PM, the last temperature reading in the trimmed data set will be several years later on a June 5 at 2:00PM. Now we can calculate the fast Fourier transform (FFT) and prepare to examine its strongest terms:
fft = np.fft.rfft(df.Fahrenheit) itop = np.abs(fft).argsort()[::-1]
If we look at the terms of the FFT in the order given by the index “itop,” here are the first few:
- A constant component of 54.98°F
- A sinusoidal component with amplitude 17.79°F and period 365.25 days
- A sinusoidal component with amplitude 8.15°F and period 1.0 days
- A sinusoidal component with amplitude 1.81°F and period 0.5 days
- A sinusoidal component with amplitude 0.98°F and period 182.625 days
- A sinusoidal component with amplitude 0.97°F and period 1948.0 days
Notice that the three strongest components picked up by the FFT are basically identical to the two-sine model we worked out before. The following code pulls out just the top three components from the FFT and produces the corresponding model from them:
fft_trunc = [ f if i in itop[:3] else 0 for i,f in enumerate(fft) ] ifft = np.fft.irfft(fft_trunc)
Here is a comparison of the model and data on one randomly-selected short time scale, showing its close resemblance to the previous model:
The residuals can be calculated as follows, and have a distribution quite similar to before:
resid = df.set_index('date').Fahrenheit - ifft
The RMSE is around 8.29, comparable to our previous two-sine approach.
It is trivial to include another component from the FFT, and the results are interesting. The fourth strongest FFT component has a period of ½ day, so that when we include it, the effect is to alter the shape of the model’s daily temperature oscillations. Here is a look at this four-component model along with data on another randomly-selected short time scale:
We can see that this new model better matches the typical daily behavior of the temperature data, with a rapid rise followed by a more gradual decline:
Of course, it is also trivial to form a model that includes many more components. For example, here are some glimpses of the results of including the 1000 strongest FFT components:
As the last results above suggest, a multiple-sine type of model could be made to fit the available data as closely as desired by using a large number of terms. But if we wanted to put the model to some predictive purpose, it would be better to use a cross-validation scheme to try to select the optimal number of terms to include for best generalization beyond the data at hand. My main aim here has been exploratory, and I would imagine that the prediction of temperature is a topic well-developed by meteorologists, so I’ll refrain from delving much further, at least for now.
There is one last issue with the residuals of the two-sine temperature model that was not explored above: they are greatly autocorrelated, with 1 day lag. We can see this in a plot of each residual versus the residual 24 hours later:
To help illustrate the simple reason for this autocorrelation in residuals, here are some plots of the data and two-sine model on a couple of longer timescales:
Notice how commonly it happens that when a residual is large positive or negative (i.e. when a temperature is much higher or lower than predicted), the residual one day before was also large in the same way. So if this model were to be used to make short-term (e.g. 10-day) predictions, it would clearly be wise to take recent residuals into account as well, as done in the ARIMA modeling paradigm, for example.
Title image: composite of winter and summer photographs of Asheville’s Biltmore Estate.