Dealing with rounded data

2016-03-24
Source

Real “continuous” data are always rounded. For instance, I already had to deal with these data:

head(dat, 15)
##    factor1 factor2   y
## 1       A1      B1 0.1
## 2       A1      B1 0.1
## 3       A1      B1 0.1
## 4       A1      B1 0.1
## 5       A1      B1 0.1
## 6       A1      B1 0.1
## 7       A1      B1 0.3
## 8       A1      B1 0.3
## 9       A1      B1 0.1
## 10      A1      B2 0.2
## 11      A1      B2 0.1
## 12      A1      B2 0.1
## 13      A1      B2 0.1
## 14      A1      B2 0.0
## 15      A1      B2 0.0

These data were recorded by a measurement device with one decimal precision. Thus, a value of \(0.1\) actually means the value lies between \(0.05\) and \(0.15\). A value of \(0\) actually means the value lies beteen \(0\) and \(0.05\) (these are nonnegative measurements).

In fact these are intervals data:

dat1 <- transform(dat, low = pmax(0,y-0.05), up = y+0.05)
head(dat1, 15)
##    factor1 factor2   y  low   up
## 1       A1      B1 0.1 0.05 0.15
## 2       A1      B1 0.1 0.05 0.15
## 3       A1      B1 0.1 0.05 0.15
## 4       A1      B1 0.1 0.05 0.15
## 5       A1      B1 0.1 0.05 0.15
## 6       A1      B1 0.1 0.05 0.15
## 7       A1      B1 0.3 0.25 0.35
## 8       A1      B1 0.3 0.25 0.35
## 9       A1      B1 0.1 0.05 0.15
## 10      A1      B2 0.2 0.15 0.25
## 11      A1      B2 0.1 0.05 0.15
## 12      A1      B2 0.1 0.05 0.15
## 13      A1      B2 0.1 0.05 0.15
## 14      A1      B2 0.0 0.00 0.05
## 15      A1      B2 0.0 0.00 0.05

Thus, assuming for instance that the true values of the measurements follow a log-normal distribution:

then one should use a rounded log-normal distribution to model the data:

By the way, one would get a problem if one intended to fit a log-normal distribution to the y values, because there are some zero values.

Using the `grouped` package

One way to deal with this issue is to use the grouped R package. It allows to fit linear regression models to grouped data. It is very easy to use:

library(grouped)
fit <- grouped(cbind(low, up) ~  factor1*factor2, link="log", data=dat1)
summary(fit)
## 
## Call:
## grouped(formula = cbind(low, up) ~ factor1 * factor2, link = "log",     data = dat1)
## 
## Model Summary:
##  log.Lik    AIC     BIC
##  -44.711 99.421 107.739
## 
## Coefficients:
##                     Esimate Std.error t.value p.value
## (Intercept)          -2.111     0.217   -9.73  <0.001
## factor1A2             0.551     0.289    1.90   0.065
## factor2B2            -0.574     0.301   -1.90   0.065
## factor1A2:factor2B2  -0.551     0.414   -1.33   0.193
## 
## Random-Effect:
##       value std.error link.distribution
## sigma 0.579    0.0821        log-normal
## 
## Optimization:
## Convergence: 0 
## max(|grad|): 0.00015 
##  Outer iter: 1

The grouped package provides confidence intervals “\(\textrm{estimate}\pm z_{1-\frac{\alpha}{2}}\textrm{stderr}\)”:

confint(fit)
##                           2.5 %     97.5 %
## (Intercept)         -2.53560037 -1.6854123
## factor1A2           -0.01637233  1.1174238
## factor2B2           -1.16437699  0.0168808
## factor1A2:factor2B2 -1.36285840  0.2617881

This method to get confidence intervals is an asymptotic one, and they are possibly too short for small sample sizes.

A Bayesian solution using STAN

With STAN, one can assign a rounded log-normal distribution to the observations with the help of the categorical distribution.

We use the cut function to create the classes of the measurements:

cuts <- c(0, seq(0.05, max(dat$y)+0.1, by=0.1), Inf)
dat2 <- transform(dat, class=cut(y, cuts, right=FALSE))
summary(dat2)
##  factor1 factor2       y                  class   
##  A1:19   B1:19   Min.   :0.0000   [0,0.05)   : 6  
##  A2:20   B2:20   1st Qu.:0.1000   [0.05,0.15):23  
##                  Median :0.1000   [0.15,0.25): 3  
##                  Mean   :0.1385   [0.25,0.35): 4  
##                  3rd Qu.:0.1500   [0.35,0.45): 2  
##                  Max.   :0.5000   [0.45,0.55): 1  
##                                   [0.55,Inf) : 0

There is no value beyond \(0.55\), hence we will fit such a categorical distribution:

where the probabilities of the classes are given by the cdf of the log-normal distribution: \[ \Pr\bigl([a,b)\bigr) = \Phi\left(\frac{\log(b)-\mu}{\sigma}\right) - \Phi\left(\frac{\log(a)-\mu}{\sigma} \right). \]

The support of the categorical distribution in STAN is \(1\), \(2\), \(\ldots\), \(K\), so we have to encode each class by an integer:

dat2 <- transform(dat2, ycat=as.integer(class))
head(dat2, 15)
##    factor1 factor2   y       class ycat
## 1       A1      B1 0.1 [0.05,0.15)    2
## 2       A1      B1 0.1 [0.05,0.15)    2
## 3       A1      B1 0.1 [0.05,0.15)    2
## 4       A1      B1 0.1 [0.05,0.15)    2
## 5       A1      B1 0.1 [0.05,0.15)    2
## 6       A1      B1 0.1 [0.05,0.15)    2
## 7       A1      B1 0.3 [0.25,0.35)    4
## 8       A1      B1 0.3 [0.25,0.35)    4
## 9       A1      B1 0.1 [0.05,0.15)    2
## 10      A1      B2 0.2 [0.15,0.25)    3
## 11      A1      B2 0.1 [0.05,0.15)    2
## 12      A1      B2 0.1 [0.05,0.15)    2
## 13      A1      B2 0.1 [0.05,0.15)    2
## 14      A1      B2 0.0    [0,0.05)    1
## 15      A1      B2 0.0    [0,0.05)    1

Now we write and compile the STAN model:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
stancode <- "
data {
  int<lower=1> N;       // number of observations
  int<lower=1> ycat[N]; // observations
  int<lower=1> P;       // number of parameters
  matrix[N,P] X;        // model matrix
  int<lower=1> K;       // number of categories
  vector[K-1] cuts;     // the cuts 0.05, 0.15, ..., 0.55
}
parameters {
  vector[P] beta; 
  real<lower=0> sigma;
}
transformed parameters {
  vector[N] mu;
  simplex[K] p[N];
  mu <- X*beta;
  for(i in 1:N){
    p[i][1] <- Phi((log(cuts[1])-mu[i])/sigma);
    for(j in 2:(K-1)){
      p[i][j] <-  Phi((log(cuts[j])-mu[i])/sigma) - Phi((log(cuts[j-1])-mu[i])/sigma);
    }
  p[i][K] <- 1.0 - sum(p[i][1:(K-1)]);
  }
}
model {
  for(i in 1:N) ycat[i] ~ categorical(p[i]);
  beta ~ normal(0, 20); // prior on the regression coefficients
  sigma ~ cauchy(0, 5); // prior on the standard deviation
}
"
# Compilation
stanmodel <- stan_model(model_code = stancode, model_name="rounded 2-way ANOVA")

And we run the STAN sampler:

# Stan data
K <- length(cuts)-1
X <- model.matrix(~factor1*factor2, data=dat2)
standata <- list(ycat=dat2$ycat, N=nrow(dat2), K=K, cuts=cuts[2:K], X=X, P=ncol(X))
# Run Stan
samples <- sampling(stanmodel, data = standata, 
                    iter = 11000, warmup = 1000, chains = 4)
# Outputs
library(coda)
codasamples <- do.call(mcmc.list, 
    plyr::alply(rstan::extract(samples, permuted=FALSE, 
                                pars=c("beta", "sigma")), 2, mcmc))
summary(codasamples)
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean     SD  Naive SE Time-series SE
## beta[1] -2.1228 0.2411 0.0012055      0.0022252
## beta[2]  0.5479 0.3240 0.0016199      0.0029783
## beta[3] -0.5780 0.3369 0.0016843      0.0031244
## beta[4] -0.5470 0.4629 0.0023146      0.0042449
## sigma    0.6537 0.1032 0.0005162      0.0008849
## 
## 2. Quantiles for each variable:
## 
##             2.5%     25%     50%     75%    97.5%
## beta[1] -2.60292 -2.2815 -2.1209 -1.9626 -1.65339
## beta[2] -0.09319  0.3348  0.5483  0.7637  1.18396
## beta[3] -1.24907 -0.7990 -0.5791 -0.3557  0.08303
## beta[4] -1.46832 -0.8486 -0.5450 -0.2444  0.37006
## sigma    0.48257  0.5813  0.6429  0.7145  0.88888

The estimates of the regression parameters are almost the same as the ones provided by the grouped package, and confidence intervals are a bit larger. The estimate of \(\sigma\) is a bit different.

Of course, the major advantage of the Bayesian way is that it can be used for any parametric model, not only the linear regression models.

« Back Home

stlaPblog

Some links

▶ R links

▶ Blogs