Data Homogeneity

It is often important to determine if a set of data is homogeneous before any statistical technique is applied to it. Homogeneous data are drawn from a single population. In other words, all outside processes that could potentially affect the data must remain constant for the complete time period of the sample. Inhomogeneities are caused when artificial changes affect the statistical properties of the observations through time. These changes may be abrupt or gradual, depending on the nature of the disturbance. Realistically, obtaining perfectly homogeneous data is almost impossible, as unavoidable changes in the area surrounding the observing station will often affect the data.

Interpreting climate data with unknown homogeneity:

Analyzing the Homogeneity of a Dataset

The following method may be used to determine whether a set of data can be considered homogeneous to a certain degree of accuracy. Example: Analyze homogeneity of data by comparing the the annual mean of the daily minimum temperature time series for Sherbrooke, Quebec and Shawinigan, Quebec from 1920 to 1970.
Locate Dataset, Variable, and Station
  • Select the "Datasets by Catagory" link in the blue banner on the Data Library page.
  • Click on the "Atmosphere" link.
  • Select the NOAA NCDC GDCN dataset.
  • Click on the "searches" link to the right of the map.
  • In the Name text box under the Searches subheading, enter Sherbrooke.
  • Click the Search NOAA NCDC GDCN button.
  • Click on the number "7028120", the first entry that appears below the search text box. CHECK
    You have selected the station identification number for Sherbrooke. To get general information on finding station ID's, click the following link to the tutorial: How to Find A Station ID
  • Scroll down the page and select the "Min Temperature" link under the Datasets and Variables subheading. CHECK
Select Temporal Domain
  • Click on the "Data Selection" link in the function bar.
  • Enter the text 1 Jan 1920 to 1 Jan 1970 in the Time text box.
  • Press the Restrict Ranges button and then the Stop Selecting button. CHECK
Compute Yearly Mean Minimum Temperature
  • Click on the "Expert Mode" link in the function bar.
  • Enter the following line under the text already there:

    T 365 boxAverage
    
  • Press the OK button. CHECK
    This command computes the mean minimum temperature for each year by taking a 365-day average of the minimum daily temperature. This is not an exact yearly average because every 4 years is a leap year, with one extra day. Every four years, the 365-day range will start one day earlier. This being ignored, we are still left with a good approximation of the mean minimum temperature per year.
View Yearly Mean Minimum Temperature Time Series
  • To see the result of this operation, choose the time series viewer. CHECK

Time Series of Average Minimum Temperature for Sherbrooke


A gradual upward trend is noticeable over the selected time range. The increase in temperature may have been caused by urbanization in the region surrounding the observing station. Has the urbanization made a sufficient impact on the data so that it may no longer be considered homogeneous over this time period? To answer this question, it is necessary to analyze the distribution of the data around the median.

Subtract Median From Dataset
  • Return to the dataset page by clicking on the right-most link in the blue source bar at the top of the page. CHECK
  • Enter into Expert Mode and type the following command under the text already there.
    [T] 1 medianover
    
  • Press the OK button. CHECK
    The median should be located below the expert mode text box in bold: 0.8683567 degrees Celsius. Take note of this value. The medianover function is further explained in the Measures of Central Tendency section.
  • In the source bar, click on the T 365 0.0 boxAverage link. CHECK
    This will undo the medianover command.
  • Click on the Expert Mode link in the function bar if the text box is not shown.
  • In the Expert Mode text box, enter the following line under the text already there:
    0.8683567 sub
    
  • Press the OK button. CHECK
    The above command subtracts the median (0.8683567° Celsius) from each value in the dataset.
Analyze Homogeneity of Data
  • Select the "Tables" link in the function bar.
  • Read the licensing agreement and click the "I agree" button to continue.
  • Select the columnar table link. CHECK
    A table will appear with Time in one column and (Min Temp - 0.8683567) in the other column. The day in the Time column changes every four years because of the leap year issue mentioned earlier.
  • Count how many times the data will make a run above or below the median.
    For example, if the value in the right column remains negative for three years and then becomes positive in the fourth year, those three years would be considered one run. If in the fifth year the value becomes negative again, then the fourth year is considered another separate run. Homogeneity can be tested by noting how many runs were present in the sample compared to how many total elements were in the sample.
  • Use the significance table below to help decide whether the minimum temperature data at Sherbrooke is homogeneous.
The table lists the number of runs for a given number above (NA) and below (NB) the median. For a 40 year series, for example, NA = NB = 20. If the number of runs falls between the .10 and .90 significance limits, there is a high probability that the data is homogeneous. Other significance tables can be obtained for sample sizes not contained in the table.
NA = NB
.10 significance level
.90 significance level
10 8 13
11 9 14
12 9 16
13 10 17
14 11 18
15 12 19
16 13 20
17 14 21
18 15 22
19 16 23
20 16 25
25 22 30
30 26 36
35 31 41
40 35 47
45 40 52
50 45 57

Oliver, John E. Climatology: Selected Applications. p 7.

There are 18 runs in the Sherbrooke data from 1920 to 1970. The total number of elements that make up the sample is 50 (each yearly mean minimum temperature constitutes one element). According to the table, at a .10 significance limit there should be at least 22 runs. We can therefore conclude, with 90% confidence, that this data is not homogeneous. Is this inhomogeneity caused by a large-scale climatic change or by an inconsistancy in the area surrounding the observing station? To answer this question, we analyze the mean minimum temperature at another station only a few miles away.

Repeat the same process for Shawinigan:
Locate Dataset and Variable
Select Temporal Domain and Station
  • Click on the "Data Selection" link in the function bar.
  • Enter the text 1 Jan 1920 to 1 Jan 1970 in the Time text box.
  • Enter the station identification 7018000 in the ISTA Station text box.
  • Press the Restrict Ranges button and then the Stop Selecting button. CHECK
    The station ID 7018000 is for Shawinigan. To get more information on finding station ID's, click the following link to the tutorial: How to Find A Station ID
Compute Yearly Mean Minimum Temperature
  • Click on the "Expert Mode" link in the function bar.
  • In the text box that appears, enter the following line under the text already there:

    T 365 boxAverage
    
  • Press the OK button. CHECK
View Yearly Mean Minimum Temperature Time Series
  • To see the results of this operation, choose the time series viewer. CHECK

Time Series of Average Minimum Temperature for Shawinigan
Based on visual inspection, these data appear to be more homogeneous than the data that taken at Sherbrooke. There isn't a distinct upward trend in the minimum temperatures, as there was in the Sherbrooke data.
Subtract Median From Dataset
  • Return to the dataset page by clicking on the right-most link on the blue source bar at the top of the page.
  • Enter into Expert Mode and type the following command under the text already there.

    [T] 1 medianover
    
  • Press the OK button.CHECK
    The median should be -0.6845208 degrees Celsius. Take note of this value.
  • In the source bar, click on the T 365 0.0 boxAverage link. CHECK
  • Click on the Expert Mode link in the function bar if the text box is not shown.
  • In the Expert Mode text box, enter the command:

    -0.6845208 sub
    
  • Press the OK button. CHECK
    The above command subtracts the median (-0.6845208° Celsius) from each value in the dataset.
Analyze Homogeneity of Data
  • Select the "Tables" link in the function bar.
  • Click the I agree button.
  • Select the columnar table link. CHECK
    A table will appear with Time in one column and (Min Temp - -0.6845208) in the other column.
  • Count how many times the data will make a run above or below the median.
    You should have counted 21 runs. Using the significance table above, we should not conclude (with a 90% confidence) that this data is homogeneous, as it is still one run short of 22. Yet, because the Shawinigan data did present more runs than the Sherbrooke data, we may conclude that the Shawinigan data is more homogeneous than the Sherbrooke data.
Shawinigan is only located a few miles northwest of Sherbrooke across the St. Lawrence River, yet the minimum temperature at Sherbrooke exhibited a noticeable upward trend over the time period while the minimum temperature at Shawinigan did not. Therefore, we can conclude that the inhomogeneity at Sherbrooke is not the result of large-scale climatic change. Instead, from 1920 to 1970, Sherbrooke had been heavily affected by human development. The increased density and height of buildings surrounding the observing station in Sherbrooke caused a small heat island, which in turn created an inhomogeneity in the data. Shawinigan, on the other hand, was not affected by development and in turn, did not experience a gradual warming over the period.