Statistical Techniques in the Streaming Data Library (SDL): A Tutorial

Statistical techniques are essential tools for analyzing large datasets; this statistics tutorial thus covers essential skills for many Streaming Data Library (SDL) users.

    One of the most common quantities used to summarize a set of data is its center. The center is a single value, chosen in such a way that it gives a reasonable approximation of normality.
    Both running and weighted averages are important filtering methods for statistical analysis.
    Climatology is commonly known as the study of our climate, yet the term encompasses many other important definitions. Climatology is also defined as the long-term average of a given variable, often over time periods of 20-30 years.
    It is often important to determine if a set of data is homogeneous before any statistical technique is applied to it. Homogeneous data are drawn from a single population.
    A random variable or random process is said to be stationary if all of its statistical parameters are independent of time. While most statistical techniques require that data is stationary, most atmospheric processes are visibly nonstationary.
    While measures of central tendency are used to estimate "normal" values of a dataset, measures of dispersion are important for describing the spread of the data, or its variation around a central value.
    The correlation is defined as the measure of linear association between two variables. A single value, commonly referred to as the correlation coefficient, is often needed to describe this association.
    Indices are diagnostic tools used to describe the state of a climate system. Climate indices are most often represented with a time series; each point in time corresponds to one index value.
    A frequency distribution is one of the most common graphical tools used to describe a single population. It is a tabulation of the frequencies of each value (or range of values).
    Singular value decomposition (SVD) is quite possibly the most widely-used multivariate statistical technique used in the atmospheric sciences. The technique was first introduced to meteorology in a 1956 paper by Edward Lorenz, in which he referred to the process as empirical orthogonal function (EOF) analysis. Today, it is also commonly known as principal-component analysis (PCA). All three names are still used, and refer to the same set of procedures within the Streaming Data Library (SDL).
    Interpolation is the process of using known data values to estimate unknown data values.