R script: correlation.R
Mlxtran code: model/correlation1.txt ; model/correlation2.txt
Correlation can be introduced between random variables which are normally distributed.
Consider two scalar parameters \(\psi_{i1}\) and \(\psi_{i2}\) and two transformations \(h_1\) and \(h_2\) such that \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\) are normally distributed:
\[ \begin{align} h_1(\psi_{i1}) &\sim {\cal N}(\mu_{i1} , \omega_1^2) \\ h_2(\psi_{i2}) &\sim {\cal N}(\mu_{i2} , \omega_2^2) \end{align} \] It is then possible to define the (linear) correlation between \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\).
If we use instead the following representation for \(\psi_{i1}\) and \(\psi_{i2}\) \[ \begin{align} h_1(\psi_{i1}) &= \mu_{i1} + \eta_{i1} \\ h_2(\psi_{i2}) &= \mu_{i2} + \eta_{i2} \end{align} \] where \(\eta_{i1} \sim {\cal N}(0 , \omega_1^2)\) and \(\eta_{i2} \sim {\cal N}(0 , \omega_2^2)\), then, defining the correlation between \(h_1(\psi_{i1})\) and \(h_2(\psi_{i2})\) is equivalent to defining the correlation between the so-called random effects \(\eta_{i1}\) and \(\eta_{i2}\).
Remark 1: any correlation which is not defined is assumed to be 0.
Remark 2: with simulx 1.2, it is only possible to define correlations in sections POPULATION
, COVARIATE
and INDIVIDUAL
. It is not possible to define correlations between residual errors in section LONGITUDINAL
.
We consider the function of time \[ f(t ; a_i,b_i,c_i) = a_i + b_it + c_i t^2 \] where the three individual parameters \(a_i\) \(b_i\) and \(c_i\) are, respectively, log-normally, normally and logit-normally distributed:
\[ \begin{align} \log(a_i) &\sim {\cal N}(\log(a_{\rm pop}) , \omega_a^2) \\ b_i &\sim {\cal N}(b_{\rm pop} , \omega_b^2) \\ \logit(c_i) &\sim {\cal N}(\logit(c_{\rm pop}) , \omega_c^2) \end{align} \]
The correlation matrix of the Gaussian vector \((\log(a_i), b_i, \logit(c_i))\) is \[ R = \left( \begin{array}{ccc} 1 & r_{ab} & r_{ac} \\ r_{ab} & 1 & r_{bc} \\ r_{ac} & r_{bc} & 1 \end{array} \right) \]
This model is implemented in the file correlation1.txt
:
[INDIVIDUAL] input = {a_pop, o_a, b_pop, o_b, c_pop, o_c, r_ab, r_ac, r_bc} DEFINITION: a = {distribution=lognormal, reference=a_pop, sd=o_a} b = {distribution=normal, reference=b_pop, sd=o_b} c = {distribution=logitnormal, reference=c_pop, sd=o_c} correlation = {r(a,b)=r_ab, r(a,c)=r_ac, r(b,c)=r_bc} [LONGITUDINAL] input={a, b, c} EQUATION: f = a + b*t + c*t^2
Let us use simulx for simulating 100 vectors of individual parameters \(\psi_i(a_i,b_i,c_i)\)
op <- list(name= c('a','b','c'))
of <- list(name='f', time=seq(0,4, by=0.1))
p <- c(a_pop=10, b_pop=-5, c_pop=0.8,
o_a =0.3, o_b =0.5, o_c =0.4,
r_ab =-0.6, r_ac =-0.4, r_bc =0.7)
g <- list(size=100, level='individual')
res <- simulx(model = "model/correlation1.txt",
parameter = p,
group = g,
output = list(op,of))
Maximum likelihood estimates of the population parameters \(a_{\rm pop}\), b_{rm pop}$ and \(c_{\rm pop}\) are the empirical medians of the simulated individual parameters:
p=res$parameter[,2:4]
print(sapply(p,median))
## a b c
## 9.8457754 -4.9920341 0.8096824
Estimates of the standard deviations of \(\omega_a\), \(\omega_b\) and \(\omega_c\) are the empirical standard deviations of the transformed parameters \(z_i=(\log(a_i), b_i, \logit(c_i))\):
z=p
z[,"a"]=log(p[,"a"])
z[,"c"]=log(p[,"c"]/(1-p[,"c"]))
print(sapply(z,sd))
## a b c
## 0.2761735 0.4964489 0.4153632
Estimate of the correlation matrix \(R\) is the empirical correlation of the transformed parameters \(z_i\):
print(cor(z))
## a b c
## a 1.0000000 -0.6127309 -0.3631155
## b -0.6127309 1.0000000 0.7325058
## c -0.3631155 0.7325058 1.0000000
We consider in this example a function of time \[ f(t ; a_i, b_i) = a_i + b_i t \] where \(\psi_i=(a_i,b_i)\) is defined by the following hierarchical model:
\[ \begin{align} a_i &= a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5} + \eta_{a,i} \\ b_i &\ = b_{\rm pop} + \eta_{b,i} \end{align} \] where \(\eta_{a,i} \sim {\cal N}(0 , \omega_a^2)\), \(\eta_{b,i} \sim {\cal N}(0 , \omega_b^2)\) and where \({\rm corr}(\eta_{a,i},\eta_{b,i})=r_{ab}\).
\[ \begin{align} w_i &\sim {\cal N}(w_{\rm pop} , \omega_w^2) \\ h_i &\sim {\cal N}(h_{\rm pop} , \omega_h^2) \end{align} \] and \({\rm corr}(w_{i},h_{i})=r_{wh}\).
This model is implemented in the file correlation2.txt
:
[COVARIATE] input = {w_pop, o_w, h_pop, o_h, r_wh} DEFINITION: w = {distribution=normal, mean=w_pop, sd=o_w} h = {distribution=normal, mean=h_pop, sd=o_h} correlation = {r(w,h)=r_wh} [INDIVIDUAL] input = {w_pop, h_pop, w, h, a_pop, o_a, b_pop, o_b, r_ab} EQUATION: a_pred = a_pop*(w/w_pop)*(h/h_pop)^0.5 DEFINITION: a = {distribution=normal, prediction=a_pred, sd=o_a} b = {distribution=normal, prediction=b_pop, sd=o_b} correlation = {r(a,b)=r_ab} [LONGITUDINAL] input={a, b} EQUATION: f = a + b*t
Let us use simulx for simulating 100 vectors of individual covariates \((w_i,h_i)\) and one vector of individual parameters \((a_i,b_i)\) per individual,
op <- list(name= c('a_pred','a','b','w','h'))
of <- list(name='f', time=seq(0,4, by=0.1))
p <- c(a_pop=10, b_pop=-5, w_pop=70, h_pop=170,
o_a =0.3, o_b =0.5, o_w =10, o_h =10,
r_ab =-0.6, r_wh =0.8)
g <- list(size=100, level='covariate')
res <- simulx(model = "model/correlation2.txt",
parameter = p,
group = g,
output = list(op,of))
and display the empirical correlation matrix of the vector \((a_i, b_i, w_i, h_i)\):
p=res$parameter[,2:6]
print(cor(p[,2:5]))
## a b w h
## a 1.00000000 -0.06779814 0.97961318 0.84737877
## b -0.06779814 1.00000000 0.04196511 -0.02714122
## w 0.97961318 0.04196511 1.00000000 0.80116869
## h 0.84737877 -0.02714122 0.80116869 1.00000000
Here, the submatrix composed by the first two rows and first two columns of this matrix is the empirical correlation matrix of \((a_i,b_i)\), it is not the correlation matrix of \((\eta_{a,i},\eta_{b,i})\). The fact that \(a_i\) is function of \(w_i\) and \(h_i\) also introduces some correlation between these variables.
If we want to estimate the correlation \(r_{ab}\) we need to consider the conditional distribution of \((a_i,b_i)\) given \((w_i,h_i)\) and not the marginal distribution of \((a_i,b_i)\). We therefore need to use \(\tilde{a}_i\) instead of \(a_i\), where \[ \tilde{a}_i = a_i - a_{\rm pop}\left(\frac{w_i}{w_{\rm pop}}\right)\left(\frac{h_i}{h_{\rm pop}}\right)^{0.5} \]
p[,"a"]=p[,"a"] - p[,"a_pred"]
print(cor(p[,2:5])) #conditional correlation between a & b (given w & h)
## a b w h
## a 1.000000000 -0.55780457 -0.009337631 -0.006928539
## b -0.557804567 1.00000000 0.041965113 -0.027141221
## w -0.009337631 0.04196511 1.000000000 0.801168689
## h -0.006928539 -0.02714122 0.801168689 1.000000000