R script: count.R
Mlxtran code: model/count.txt
Longitudinal count data is a special type of longitudinal data that can take only nonnegative integer values \(\{0, 1, 2,\ldots\}\) that come from counting something, e.g., the number of seizures, hemorrhages or lesions in each given time period.
Considering the observations \((y_{j},\, 1 \leq j \leq n)\) as a sequence of conditionally independent random variables, the model is completely defined by the probability mass functions \[\prob{y_{j}=k} \quad \text{for} \ k=0, 1, 2,\ldots \ \text{and} \ 1 \leq j \leq n\]
The distribution of count data can be defined in the block DEFINITION
of the Section [LONGITUDINAL]
using the probability mass functions.
In this example, the Poisson distribution is used for defining the distribution of \(y_{j}\): \[ y_j \sim \text{Poisson}(\lambda_j) \] where the Poisson intensity \(\lambda_j\) is function of time: \[ \lambda_j = a + b \, t_j \]
This model is first implemented using the Mlxtran library of distributions:
poissonModela <- inlineModel("
[LONGITUDINAL]
input = {a,b}
EQUATION:
lambda = a +b*t
DEFINITION:
y = {distribution=poisson, lambda=lambda}
")
We will use simulx for computing \(\lambda\) every hour between 0h and 100h and for generating observations \((y_j)\) every 2 hours between 0h and 100h, with \(a=10\) and \(b=0.5\).
p <- c(a=10, b=0.5)
l <- list(name='lambda', time=seq(0, 100, by=1))
y <- list(name='y', time=seq(0, 100, by=4))
res1a <- simulx(model=poissonModela,
parameter=p,
output=list(l, y))
we can now plot the Poisson intensity and the simulated data:
print(ggplot(aes(x=time, y=lambda), data=res1a$lambda) + geom_line(size=1) +
geom_point(aes(x=time, y=y), data=res1a$y, color="red") + ylab("") )
Remark: instead of using the library of distributions it is possible to define the probability mass function in the Mlxtran code:
poissonModelb <- inlineModel("
[LONGITUDINAL]
input = {a,b}
EQUATION:
lambda=a +b*t
DEFINITION:
y = {type=count, P(y=k)=exp(-lambda)*(lambda^k)/factorial(k)}
")
Both implementations are equivalent.
Suppose there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called “success” and “failure”. In each trial the probability of success is \(p\) and of failure is \((1-p)\). We are observing this sequence until a predefined number \(n\) of failures has occurred. Then the random number of successes, \(y\), we have seen will have the negative binomial distribution.
For \(k=0,1,2,\ldots\),
\[\prob{y=k} = \left( \begin{array}{c} k+n-1 \\ k \end{array} \right)p^k (1-p)^n\]
Let us use simulx
for drawing a random sample of size \(N=10000\) from a negative binomial distribution:
negbinModela <- inlineModel("
[LONGITUDINAL]
input = {n,p}
DEFINITION:
y = {distribution=negativeBinomial, size=n, prob=p}
")
param <- c(n=10,p=0.4)
N <- 10000
y <- list(name='y', time=seq(1, N, by=1))
res2a <- simulx(model=negbinModela,
parameter=param,
output=y)
We can use equivalently define the probability mass function in the Mlxtran code:
negbinModelb <- inlineModel("
[LONGITUDINAL]
input = {n,p}
DEFINITION:
y = {type = count,
P(y=k) = factorial(k+n-1)/factorial(k)/factorial(n-1)*((1-p)^k)*(p^n)
}
")
res2b <- simulx(model=negbinModelb,
parameter=param,
output=y)
We can now compare the empirical distribution of these 2 samples with the theoretical negative binomial distribution:
library(plyr)
library(reshape2)
ca=count(res2a$y$y)
cb=count(res2b$y$y)
names(ca) <- c('k','obs.a')
names(cb) <- c('k','obs.b')
c <- merge(ca, cb)
c$theo <- dnbinom(c$k,size=param[1],p=param[2])*N
r <- melt(c, id = 'k', variable.name = 'freq')
print(ggplot(r, aes(k,value, colour=freq)) + geom_point())