Correlation Analysis (Draft)

The Correlation Analysis leverages copulas technique to correlate multiple variables together. A copula is a statistical tool used to model the dependence (correlation or other relationships) between random variables. The key strength of copulas is that they allow us to model the joint behavior of multiple variables separately from their marginal distributions.

Marginals describe the behavior of each variable on its own.
Dependence structure shows how the variables are related.

Hydrological processes often involve multiple interdependent variables. For example:

Surrounding antecedent soil moisture for a given rainfall
Time of concentration and storage coefficient unit hydrograph transform function
Soil moisture and baseflow volume

By Sklar’s theorem, any joint distribution

$\begin{array}{l}\displaystyle F(x_1,...,x_d)\end{array}$

with marginals $\begin{array}{l}F_i(X_i)\end{array}$ can be written as

$\begin{array}{l}\displaystyle F(x_1,...,x_d)=C(F_1(x_1),...,F_d(x_d))\end{array}$

where is the copula. Copulas allow convenient definitions of dependence measures such as Pearson's coefficient, Spearman’s rho and Kendall’s tau, which can also be expressed in terms of the copula. Several copula types are widely used: the Gaussian copula, built from a Gaussian Cumulative Distribution Function with correlation matrix exhibits no tail dependence; the t copula, based on the multivariate Student- distribution, allows symmetric tail dependence; and Archimedean copulas, such as the Clayton, Gumbel, and Frank copulas, provide flexible one-parameter models with different tail properties.

HEC-HMS currently incorporates the Gaussian and t copula. The software leverages Apache Common Math library CorrelatedRandomVectorGenerator class to generate multivariate normal samples. The generator is initialized with a mean vector (all zeros for copula construction), a covariance or correlation matrix that defines the dependence structure, a random number generator for producing independent standard normal variates, and a small numerical tolerance (small) to ensure stability in the Cholesky decomposition when the covariance matrix is nearly singular.

HEC-HMS implementation starts with independent standard normal random numbers. Then using Cholesky decomposition the correlation/covariance matrix transform the independent numbers to obtain correlated pattern.

The Cholesky decomposition of $\begin{array}{l}\sum\end{array}$

$\begin{array}{l}\displaystyle \sum=LL^T\end{array}$

where L is the lower-triangular. The transformation equation:

$\begin{array}{l}\displaystyle X=LZ\end{array}$

where $\begin{array}{l}L\end{array}$ is the Cholesky lower-triangle and $\begin{array}{l}Z\end{array}$ is the independent normals. Each generated sample is a vector of correlated standard normal values $\begin{array}{l}X = (X_1,...,X_d)\end{array}$ . To obtain uniform margins, the standard normal cumulative distribution function is applied to each component: $\begin{array}{l}U_i=\Phi(X_i)\end{array}$ , which ensures that $\begin{array}{l}U_i \thicksim Uniform(0,1)\end{array}$ . The vector $\begin{array}{l}U=(U_1,...,U_d)\end{array}$ represents a sample from the Gaussian copula with the specified correlation structure. Finally, transform each uniform value into the desired marginal distribution by applying the corresponding inverse CDF. The marginal distributions use the Simple Distribution methods described here. Each distribution will need to have its own parameters defined for the correlated sampling to function properly.

Positive Definite Matrix

A Gaussian and Student's t copula uses a correlation or covariance matrix to define the relationship between variables. When performing Cholesky's decomposition, a requirement of the matrix is that is must be a positive definite matrix. A positive definite matrix is a symmetric matrix where all eigenvalues are positive. If the matrix were not positive definite, it would imply impossible relationships (like variances less than zero, or inconsistent correlation patterns), and the joint distribution could not be constructed. Positive definiteness guarantees that the Cholesky decomposition exists, which is the basis for transforming independent random variables into correlated ones. Without this property, simulation and likelihood evaluation would fail. HEC-HMS will force all matrices to be symmetric by only allowing users to enter correlation values on the lower triangle. The software has two methods for correcting matrices that do not satisfy Positive-Definite: the Jitter method and Nearest Positive Definite.

Jitter

Jitter method adds a small value to the diagonal entries of the correlation/covariance matrix, making it strictly positive definite.

$\begin{array}{l}\displaystyle R^*=R+\epsilon I\end{array}$

where $\begin{array}{l}I\end{array}$ is the identity matrix. The adjusted matrix $\begin{array}{l}R^*\end{array}$ is re-normalized so the diagonals are set back to 1.

$\begin{array}{l}\displaystyle d_i=\sqrt{R_{ii}^*},i = 1,...,n\end{array}$

$\begin{array}{l}\displaystyle D=diag(d_1, d_2...,d_n)\end{array}$

$\begin{array}{l}\displaystyle R_{corr}=D^{-1}R^*D^{-1}\end{array}$

The adjusted matrix is checked if for positive definite. If so, Cholesky decomposition is performed on the adjusted matrix. If not, a small value is added to the adjusted matrix and the process repeats until the adjusted matrix is positive definite.

Nearest Positive Definite

The Nearest Positive Definite is a method to find the closest valid positive definite matrix. The software uses the algorithm developed by Nicholas J Higham (Higham, 1988). His method is described in more detail here. The implementation of his method is described here.