R script: correlation.R

Mlxtran code: model/correlation1.txt ; model/correlation2.txt


$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}}$$

1 Introduction

Correlation can be introduced between random variables which are normally distributed.

Consider two scalar parameters \(\psi_{i1}\) and \(\psi_{i2}\) and two transformations \(h_1\) and \(h_2\) such that \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\) are normally distributed:

\[ \begin{align} h_1(\psi_{i1}) &\sim {\cal N}(\mu_{i1} , \omega_1^2) \\ h_2(\psi_{i2}) &\sim {\cal N}(\mu_{i2} , \omega_2^2) \end{align} \] It is then possible to define the (linear) correlation between \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\).

If we use instead the following representation for \(\psi_{i1}\) and \(\psi_{i2}\) \[ \begin{align} h_1(\psi_{i1}) &= \mu_{i1} + \eta_{i1} \\ h_2(\psi_{i2}) &= \mu_{i2} + \eta_{i2} \end{align} \] where \(\eta_{i1} \sim {\cal N}(0 , \omega_1^2)\) and \(\eta_{i2} \sim {\cal N}(0 , \omega_2^2)\), then, defining the correlation between \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\) is equivalent to defining the correlation between the so-called random effects \(\eta_{i1}\) and \(\eta_{i2}\).

Remark 1: any correlation which is not defined is assumed to be 0.

Remark 2: with simulx 1.2, it is only possible to define correlations in sections POPULATION, COVARIATE and INDIVIDUAL. It is not possible to define correlations between residual errors in section LONGITUDINAL.


2 Examples

2.1 Example 1

We consider the function of time \[ f(t ; a_i,b_i,c_i) = a_i + b_it + c_i t^2 \] where the three individual parameters \(a_i\) \(b_i\) and \(c_i\) are, respectively, log-normally, normally and logit-normally distributed:

\[ \begin{align} \log(a_i) &\sim {\cal N}(\log(a_{\rm pop}) , \omega_a^2) \\ b_i &\sim {\cal N}(b_{\rm pop} , \omega_b^2) \\ \logit(c_i) &\sim {\cal N}(\logit(c_{\rm pop}) , \omega_c^2) \end{align} \]

The correlation matrix of the Gaussian vector \((\log(a_i), b_i, \logit(c_i))\) is \[ R = \left( \begin{array}{ccc} 1 & r_{ab} & r_{ac} \\ r_{ab} & 1 & r_{bc} \\ r_{ac} & r_{bc} & 1 \end{array} \right) \]

This model is implemented in the file correlation1.txt:

[INDIVIDUAL]
input = {a_pop, o_a, b_pop, o_b, c_pop, o_c, r_ab, r_ac, r_bc}
DEFINITION:
a = {distribution=lognormal,   reference=a_pop, sd=o_a}
b = {distribution=normal,      reference=b_pop, sd=o_b}
c = {distribution=logitnormal, reference=c_pop, sd=o_c}
correlation = {r(a,b)=r_ab, r(a,c)=r_ac, r(b,c)=r_bc}

[LONGITUDINAL]
input={a, b, c}
EQUATION:
f = a + b*t + c*t^2

Let us use simulx for simulating 100 vectors of individual parameters \(\psi_i(a_i,b_i,c_i)\)

op <- list(name= c('a','b','c'))
of <- list(name='f', time=seq(0,4, by=0.1))

p <- c(a_pop=10,   b_pop=-5,   c_pop=0.8, 
       o_a  =0.3,  o_b  =0.5,  o_c  =0.4, 
       r_ab =-0.6, r_ac =-0.4, r_bc =0.7)

g <- list(size=100, level='individual')

res <- simulx(model     = "model/correlation1.txt",
              parameter = p, 
              group     = g, 
              output    = list(op,of))

Maximum likelihood estimates of the population parameters \(a_{\rm pop}\), b_{rm pop}$ and \(c_{\rm pop}\) are the empirical medians of the simulated individual parameters:

p=res$parameter[,2:4]
print(sapply(p,median))
##          a          b          c 
##  9.8457754 -4.9920341  0.8096824

Estimates of the standard deviations of \(\omega_a\), \(\omega_b\) and \(\omega_c\) are the empirical standard deviations of the transformed parameters \(z_i=(\log(a_i), b_i, \logit(c_i))\):

z=p
z[,"a"]=log(p[,"a"])
z[,"c"]=log(p[,"c"]/(1-p[,"c"]))
print(sapply(z,sd))
##         a         b         c 
## 0.2761735 0.4964489 0.4153632

Estimate of the correlation matrix \(R\) is the empirical correlation of the transformed parameters \(z_i\):

print(cor(z))
##            a          b          c
## a  1.0000000 -0.6127309 -0.3631155
## b -0.6127309  1.0000000  0.7325058
## c -0.3631155  0.7325058  1.0000000


2.2 Example 2

We consider in this example a function of time \[ f(t ; a_i, b_i) = a_i + b_i t \] where \(\psi_i=(a_i,b_i)\) is defined by the following hierarchical model:

  1. Conditionally to \((w_i,h_i)\), \(a_i\) and \(b_i\) are normally distributed:

\[ \begin{align} a_i &= a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5} + \eta_{a,i} \\ b_i &\ = b_{\rm pop} + \eta_{b,i} \end{align} \] where \(\eta_{a,i} \sim {\cal N}(0 , \omega_a^2)\), \(\eta_{b,i} \sim {\cal N}(0 , \omega_b^2)\) and where \({\rm corr}(\eta_{a,i},\eta_{b,i})=r_{ab}\).

  1. \(w_i\) and \(h_i\) are normally distributed:

\[ \begin{align} w_i &\sim {\cal N}(w_{\rm pop} , \omega_w^2) \\ h_i &\sim {\cal N}(h_{\rm pop} , \omega_h^2) \end{align} \] and \({\rm corr}(w_{i},h_{i})=r_{wh}\).

This model is implemented in the file correlation2.txt:

[COVARIATE]
input = {w_pop, o_w, h_pop, o_h, r_wh}
DEFINITION:
w = {distribution=normal, mean=w_pop, sd=o_w}
h = {distribution=normal, mean=h_pop, sd=o_h}
correlation = {r(w,h)=r_wh}

[INDIVIDUAL]
input = {w_pop, h_pop, w, h, a_pop, o_a, b_pop, o_b, r_ab}
EQUATION:
a_pred = a_pop*(w/w_pop)*(h/h_pop)^0.5
DEFINITION:
a = {distribution=normal, prediction=a_pred, sd=o_a}
b = {distribution=normal, prediction=b_pop,  sd=o_b}
correlation = {r(a,b)=r_ab}

[LONGITUDINAL]
input={a, b}
EQUATION:
f = a + b*t

Let us use simulx for simulating 100 vectors of individual covariates \((w_i,h_i)\) and one vector of individual parameters \((a_i,b_i)\) per individual,

op <- list(name= c('a_pred','a','b','w','h'))
of <- list(name='f', time=seq(0,4, by=0.1))

p <- c(a_pop=10,   b_pop=-5,   w_pop=70, h_pop=170,
       o_a  =0.3,  o_b  =0.5,  o_w  =10, o_h  =10, 
       r_ab =-0.6, r_wh =0.8)

g <- list(size=100, level='covariate')

res <- simulx(model     = "model/correlation2.txt",
              parameter = p, 
              group     = g, 
              output    = list(op,of))

and display the empirical correlation matrix of the vector \((a_i, b_i, w_i, h_i)\):

p=res$parameter[,2:6]
print(cor(p[,2:5]))  
##             a           b          w           h
## a  1.00000000 -0.06779814 0.97961318  0.84737877
## b -0.06779814  1.00000000 0.04196511 -0.02714122
## w  0.97961318  0.04196511 1.00000000  0.80116869
## h  0.84737877 -0.02714122 0.80116869  1.00000000

Here, the submatrix composed by the first two rows and first two columns of this matrix is the empirical correlation matrix of \((a_i,b_i)\), it is not the correlation matrix of \((\eta_{a,i},\eta_{b,i})\). The fact that \(a_i\) is function of \(w_i\) and \(h_i\) also introduces some correlation between these variables.

If we want to estimate the correlation \(r_{ab}\) we need to consider the conditional distribution of \((a_i,b_i)\) given \((w_i,h_i)\) and not the marginal distribution of \((a_i,b_i)\). We therefore need to use \(\tilde{a}_i\) instead of \(a_i\), where \[ \tilde{a}_i = a_i - a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5} \]

p[,"a"]=p[,"a"] - p[,"a_pred"]
print(cor(p[,2:5]))  #conditional correlation between a & b (given w & h)
##              a           b            w            h
## a  1.000000000 -0.55780457 -0.009337631 -0.006928539
## b -0.557804567  1.00000000  0.041965113 -0.027141221
## w -0.009337631  0.04196511  1.000000000  0.801168689
## h -0.006928539 -0.02714122  0.801168689  1.000000000