R script: categorical.R
Mlxtran code: model/categoricalA.txt ; model/categoricalB.txt
Assume now that the observed data takes its values in a fixed and finite set of nominal categories \(\{c_1, c_2,\ldots , c_K\}\)
Considering the observations \((y_{j},\, 1 \leq j \leq n)\) as a sequence of conditionally independent random variables, the model is completely defined by the probability mass functions \[\prob{y_{j}=c_k} \quad \text{for} \ k=1,\ldots, K \ \text{and} \ 1 \leq j \leq n\]
For a given \(j\), the sum of the \(K\) probabilities is 1, so in fact only \(K-1\) of them need to be defined.
Ordinal data further assume that the categories are ordered: \[c_1 \leq c_2,\leq \ldots \leq c_K .\]
Instead of defining the probabilities of each category, it may be convenient to define the cumulative probabilities \[\prob{y_{j} \leq c_k}, \quad \text{for} \ k=1,\ldots ,K-1\] or the cumulative logits \[{\rm logit}\left(\prob{y_{j} \leq c_k}\right),\quad \text{for} \ k=1,\ldots ,K-1\]
where \({\rm logit}(p) = \log(p/(1-p))\).
The distribution of ordered categorical data can be defined in the block DEFINITION
of the Section [LONGITUDINAL]
using either the probability mass functions, the cumulative probabilities or the cumulative logits.
\(y_j\) takes its values in \(\{0, 1, 2\}\). We use cumulative logits to define the distribution of \(y_j\):
\[ \begin{align} {\rm logit}\left(\prob{y_{j} \leq 0}\right) &= a - b\, t_j \\ {\rm logit}\left(\prob{y_{j} \leq 1}\right) &= a - b\, t_j/2 \end{align} \] We can then derive the probabilities for each category: \[ \begin{align} \prob{y_{j} = 0} &= \frac{1}{1+e^{-(a - b\, t_j)}} \\ \prob{y_{j} = 1} &= \frac{1}{1+e^{-(a - b\, t_j/2)}} - \prob{y_{j} = 0} \\ \prob{y_{j} = 2} &= 1 - \prob{y_{j} = 0} - \prob{y_{j} = 1} \end{align} \]
This model is implemented in the file categoricalA.txt
:
[LONGITUDINAL] input = {a,b} EQUATION: lp0 = a-b*t lp1 = a-b*t/2 p0 = 1/(1+exp(-lp0)) p1 = 1/(1+exp(-lp1)) -p0 p2 = 1-p0-p1 DEFINITION: y = {type = categorical, categories = {0, 1, 2}, P(y=0) = p0, P(y=1) = p1}
We will use simulx for computing \(p_0\), \(p_1\) and \(p_2\) every hour between 0h and 100h and for generating observations \((y_j)\) every 2 hours between 0h and 100h, with \(a=8\) and \(b=0.2\).
seed <- 12345
p <- c(a=8,b=0.2)
pr <- list(name=c('p0','p1','p2'), time=0:100)
y <- list(name='y', time=seq(0, 100, by=2))
res <- simulx(model = 'model/categoricalA.txt',
parameter = p,
output = list(pr, y),
settings = list(seed=seed))
we can now plot the 3 probabilities and the simulated data:
library(gridExtra)
plot1=ggplot() + ylab("probabilities") +
geom_line(data=res$p0, aes(x=time, y=p0, colour="p0"),size=1) +
geom_line(data=res$p1, aes(x=time, y=p1, colour="p1"),size=1) +
geom_line(data=res$p2, aes(x=time, y=p2, colour="p2"),size=1) +
theme(legend.position=c(0.1,0.5),legend.title=element_blank())
plot2=ggplot() + geom_point(aes(x=time, y=y), data=res$y)
grid.arrange(plot1, plot2, ncol=2)
We can equivalently use categoricalB.txt
where the cumulative logits define the distribution of the \(y_j\)’s:
[LONGITUDINAL] input = {a,b} EQUATION: lp0 = a-b*t lp1 = a-b*t/2 DEFINITION: y = {type = categorical, categories = {0,1,2}, logit(P(y<=0)) = lp0, logit(P(y<=1)) = lp1}
We therefore use simulx for computing \(lp_0\) and \(lp_1\) and for generating observations \((y_j)\)
lpr <- list(name=c('lp0','lp1'), time=0:100)
res <- simulx(model = 'model/categoricalB.txt',
parameter = p,
output = list(y,lpr),
settings = list(seed=seed))
We can then derive the probability mass function from the cumulative logits
p0 <- 1/(1+exp(-res$lp0$lp0))
p1 <- 1/(1+exp(-res$lp1$lp1))-p0
p2 <- 1-p0-p1
and create a single data frame with these probabilities
library(reshape2)
pr <- data.frame(time=0:100,p0,p1,p2)
pr <- melt(pr , id = 'time', variable.name = 'proba')
# pr is a data frame with columns "id", "proba" and "value"
plot1=ggplot(pr, aes(time,value)) + geom_line(aes(colour = proba),size=1) +
ylab('probabilities') + theme(legend.position=c(.1, .5))
plot2=ggplot() + geom_point(aes(x=time, y=y), data=res$y)
grid.arrange(plot1, plot2, ncol=2)
When \(N\) replicates of the data are simulated, catplotmlx can be used to display the empirical distribution of the simulated data:
g <- list(size=1000)
res <- simulx(model = 'model/categoricalA.txt',
parameter = p,
output = y,
group = g)
plot3 <- catplotmlx(res$y, breaks=25)
print(plot3)