R script: correlation.R

Mlxtran code: model/correlation1.txt ; model/correlation2.txt


# 1 Introduction

Correlation can be introduced between random variables which are normally distributed.

Consider two scalar parameters $$\psi_{i1}$$ and $$\psi_{i2}$$ and two transformations $$h_1$$ and $$h_2$$ such that $$h_1(\psi_{i1})$$ and $$h_2(\psi_{i2})$$ are normally distributed:

\begin{align} h_1(\psi_{i1}) &\sim {\cal N}(\mu_{i1} , \omega_1^2) \\ h_2(\psi_{i2}) &\sim {\cal N}(\mu_{i2} , \omega_2^2) \end{align} It is then possible to define the (linear) correlation between $$h_1(\psi_{i1})$$ and $$h_2(\psi_{i2})$$.

If we use instead the following representation for $$\psi_{i1}$$ and $$\psi_{i2}$$ \begin{align} h_1(\psi_{i1}) &= \mu_{i1} + \eta_{i1} \\ h_2(\psi_{i2}) &= \mu_{i2} + \eta_{i2} \end{align} where $$\eta_{i1} \sim {\cal N}(0 , \omega_1^2)$$ and $$\eta_{i2} \sim {\cal N}(0 , \omega_2^2)$$, then, defining the correlation between $$h_1(\psi_{i1})$$ and $$h_2(\psi_{i2})$$ is equivalent to defining the correlation between the so-called random effects $$\eta_{i1}$$ and $$\eta_{i2}$$.

Remark 1: any correlation which is not defined is assumed to be 0.

Remark 2: with simulx 1.2, it is only possible to define correlations in sections POPULATION, COVARIATE and INDIVIDUAL. It is not possible to define correlations between residual errors in section LONGITUDINAL.

# 2 Examples

## 2.1 Example 1

We consider the function of time $f(t ; a_i,b_i,c_i) = a_i + b_it + c_i t^2$ where the three individual parameters $$a_i$$ $$b_i$$ and $$c_i$$ are, respectively, log-normally, normally and logit-normally distributed:

\begin{align} \log(a_i) &\sim {\cal N}(\log(a_{\rm pop}) , \omega_a^2) \\ b_i &\sim {\cal N}(b_{\rm pop} , \omega_b^2) \\ \logit(c_i) &\sim {\cal N}(\logit(c_{\rm pop}) , \omega_c^2) \end{align}

The correlation matrix of the Gaussian vector $$(\log(a_i), b_i, \logit(c_i))$$ is $R = \left( \begin{array}{ccc} 1 & r_{ab} & r_{ac} \\ r_{ab} & 1 & r_{bc} \\ r_{ac} & r_{bc} & 1 \end{array} \right)$

This model is implemented in the file correlation1.txt:

[INDIVIDUAL]
input = {a_pop, o_a, b_pop, o_b, c_pop, o_c, r_ab, r_ac, r_bc}
DEFINITION:
a = {distribution=lognormal,   reference=a_pop, sd=o_a}
b = {distribution=normal,      reference=b_pop, sd=o_b}
c = {distribution=logitnormal, reference=c_pop, sd=o_c}
correlation = {r(a,b)=r_ab, r(a,c)=r_ac, r(b,c)=r_bc}

[LONGITUDINAL]
input={a, b, c}
EQUATION:
f = a + b*t + c*t^2


Let us use simulx for simulating 100 vectors of individual parameters $$\psi_i(a_i,b_i,c_i)$$

op <- list(name= c('a','b','c'))
of <- list(name='f', time=seq(0,4, by=0.1))

p <- c(a_pop=10,   b_pop=-5,   c_pop=0.8,
o_a  =0.3,  o_b  =0.5,  o_c  =0.4,
r_ab =-0.6, r_ac =-0.4, r_bc =0.7)

g <- list(size=100, level='individual')

res <- simulx(model     = "model/correlation1.txt",
parameter = p,
group     = g,
output    = list(op,of))

Maximum likelihood estimates of the population parameters $$a_{\rm pop}$$, b_{rm pop}$and $$c_{\rm pop}$$ are the empirical medians of the simulated individual parameters: p=res$parameter[,2:4]
print(sapply(p,median))
##          a          b          c
##  9.8457754 -4.9920341  0.8096824

Estimates of the standard deviations of $$\omega_a$$, $$\omega_b$$ and $$\omega_c$$ are the empirical standard deviations of the transformed parameters $$z_i=(\log(a_i), b_i, \logit(c_i))$$:

z=p
z[,"a"]=log(p[,"a"])
z[,"c"]=log(p[,"c"]/(1-p[,"c"]))
print(sapply(z,sd))
##         a         b         c
## 0.2761735 0.4964489 0.4153632

Estimate of the correlation matrix $$R$$ is the empirical correlation of the transformed parameters $$z_i$$:

print(cor(z))
##            a          b          c
## a  1.0000000 -0.6127309 -0.3631155
## b -0.6127309  1.0000000  0.7325058
## c -0.3631155  0.7325058  1.0000000

## 2.2 Example 2

We consider in this example a function of time $f(t ; a_i, b_i) = a_i + b_i t$ where $$\psi_i=(a_i,b_i)$$ is defined by the following hierarchical model:

1. Conditionally to $$(w_i,h_i)$$, $$a_i$$ and $$b_i$$ are normally distributed:

\begin{align} a_i &= a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5} + \eta_{a,i} \\ b_i &\ = b_{\rm pop} + \eta_{b,i} \end{align} where $$\eta_{a,i} \sim {\cal N}(0 , \omega_a^2)$$, $$\eta_{b,i} \sim {\cal N}(0 , \omega_b^2)$$ and where $${\rm corr}(\eta_{a,i},\eta_{b,i})=r_{ab}$$.

1. $$w_i$$ and $$h_i$$ are normally distributed:

\begin{align} w_i &\sim {\cal N}(w_{\rm pop} , \omega_w^2) \\ h_i &\sim {\cal N}(h_{\rm pop} , \omega_h^2) \end{align} and $${\rm corr}(w_{i},h_{i})=r_{wh}$$.

This model is implemented in the file correlation2.txt:

[COVARIATE]
input = {w_pop, o_w, h_pop, o_h, r_wh}
DEFINITION:
w = {distribution=normal, mean=w_pop, sd=o_w}
h = {distribution=normal, mean=h_pop, sd=o_h}
correlation = {r(w,h)=r_wh}

[INDIVIDUAL]
input = {w_pop, h_pop, w, h, a_pop, o_a, b_pop, o_b, r_ab}
EQUATION:
a_pred = a_pop*(w/w_pop)*(h/h_pop)^0.5
DEFINITION:
a = {distribution=normal, prediction=a_pred, sd=o_a}
b = {distribution=normal, prediction=b_pop,  sd=o_b}
correlation = {r(a,b)=r_ab}

[LONGITUDINAL]
input={a, b}
EQUATION:
f = a + b*t


Let us use simulx for simulating 100 vectors of individual covariates $$(w_i,h_i)$$ and one vector of individual parameters $$(a_i,b_i)$$ per individual,

op <- list(name= c('a_pred','a','b','w','h'))
of <- list(name='f', time=seq(0,4, by=0.1))

p <- c(a_pop=10,   b_pop=-5,   w_pop=70, h_pop=170,
o_a  =0.3,  o_b  =0.5,  o_w  =10, o_h  =10,
r_ab =-0.6, r_wh =0.8)

g <- list(size=100, level='covariate')

res <- simulx(model     = "model/correlation2.txt",
parameter = p,
group     = g,
output    = list(op,of))

and display the empirical correlation matrix of the vector $$(a_i, b_i, w_i, h_i)$$:

p=res\$parameter[,2:6]
print(cor(p[,2:5]))  
##             a           b          w           h
## a  1.00000000 -0.06779814 0.97961318  0.84737877
## b -0.06779814  1.00000000 0.04196511 -0.02714122
## w  0.97961318  0.04196511 1.00000000  0.80116869
## h  0.84737877 -0.02714122 0.80116869  1.00000000

Here, the submatrix composed by the first two rows and first two columns of this matrix is the empirical correlation matrix of $$(a_i,b_i)$$, it is not the correlation matrix of $$(\eta_{a,i},\eta_{b,i})$$. The fact that $$a_i$$ is function of $$w_i$$ and $$h_i$$ also introduces some correlation between these variables.

If we want to estimate the correlation $$r_{ab}$$ we need to consider the conditional distribution of $$(a_i,b_i)$$ given $$(w_i,h_i)$$ and not the marginal distribution of $$(a_i,b_i)$$. We therefore need to use $$\tilde{a}_i$$ instead of $$a_i$$, where $\tilde{a}_i = a_i - a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5}$

p[,"a"]=p[,"a"] - p[,"a_pred"]
print(cor(p[,2:5]))  #conditional correlation between a & b (given w & h)
##              a           b            w            h
## a  1.000000000 -0.55780457 -0.009337631 -0.006928539
## b -0.557804567  1.00000000  0.041965113 -0.027141221
## w -0.009337631  0.04196511  1.000000000  0.801168689
## h -0.006928539 -0.02714122  0.801168689  1.000000000