-
Dealing with rounded data
2016-03-24
SourceReal “continuous” data are always rounded. For instance, I already had to deal with these data:
head(dat, 15) ## factor1 factor2 y ## 1 A1 B1 0.1 ## 2 A1 B1 0.1 ## 3 A1 B1 0.1 ## 4 A1 B1 0.1 ## 5 A1 B1 0.1 ## 6 A1 B1 0.1 ## 7 A1 B1 0.3 ## 8 A1 B1 0.3 ## 9 A1 B1 0.1 ## 10 A1 B2 0.2 ## 11 A1 B2 0.1 ## 12 A1 B2 0.1 ## 13 A1 B2 0.1 ## 14 A1 B2 0.0 ## 15 A1 B2 0.0
These data were recorded by a measurement device with one decimal precision. Thus, a value of \(0.1\) actually means the value lies between \(0.05\) and \(0.15\). A value of \(0\) actually means the value lies beteen \(0\) and \(0.05\) (these are nonnegative measurements).
In fact these are intervals data:
dat1 <- transform(dat, low = pmax(0,y-0.05), up = y+0.05) head(dat1, 15) ## factor1 factor2 y low up ## 1 A1 B1 0.1 0.05 0.15 ## 2 A1 B1 0.1 0.05 0.15 ## 3 A1 B1 0.1 0.05 0.15 ## 4 A1 B1 0.1 0.05 0.15 ## 5 A1 B1 0.1 0.05 0.15 ## 6 A1 B1 0.1 0.05 0.15 ## 7 A1 B1 0.3 0.25 0.35 ## 8 A1 B1 0.3 0.25 0.35 ## 9 A1 B1 0.1 0.05 0.15 ## 10 A1 B2 0.2 0.15 0.25 ## 11 A1 B2 0.1 0.05 0.15 ## 12 A1 B2 0.1 0.05 0.15 ## 13 A1 B2 0.1 0.05 0.15 ## 14 A1 B2 0.0 0.00 0.05 ## 15 A1 B2 0.0 0.00 0.05
Thus, assuming for instance that the true values of the measurements follow a log-normal distribution:
then one should use a rounded log-normal distribution to model the data:
By the way, one would get a problem if one intended to fit a log-normal distribution to the
y
values, because there are some zero values.Using the
grouped
packageOne way to deal with this issue is to use the
grouped
R package. It allows to fit linear regression models to grouped data. It is very easy to use:library(grouped) fit <- grouped(cbind(low, up) ~ factor1*factor2, link="log", data=dat1) summary(fit) ## ## Call: ## grouped(formula = cbind(low, up) ~ factor1 * factor2, link = "log", data = dat1) ## ## Model Summary: ## log.Lik AIC BIC ## -44.711 99.421 107.739 ## ## Coefficients: ## Esimate Std.error t.value p.value ## (Intercept) -2.111 0.217 -9.73 <0.001 ## factor1A2 0.551 0.289 1.90 0.065 ## factor2B2 -0.574 0.301 -1.90 0.065 ## factor1A2:factor2B2 -0.551 0.414 -1.33 0.193 ## ## Random-Effect: ## value std.error link.distribution ## sigma 0.579 0.0821 log-normal ## ## Optimization: ## Convergence: 0 ## max(|grad|): 0.00015 ## Outer iter: 1
The
grouped
package provides confidence intervals “\(\textrm{estimate}\pm z_{1-\frac{\alpha}{2}}\textrm{stderr}\)”:confint(fit) ## 2.5 % 97.5 % ## (Intercept) -2.53560037 -1.6854123 ## factor1A2 -0.01637233 1.1174238 ## factor2B2 -1.16437699 0.0168808 ## factor1A2:factor2B2 -1.36285840 0.2617881
This method to get confidence intervals is an asymptotic one, and they are possibly too short for small sample sizes.
A Bayesian solution using STAN
With STAN, one can assign a rounded log-normal distribution to the observations with the help of the
categorical
distribution.We use the
cut
function to create the classes of the measurements:cuts <- c(0, seq(0.05, max(dat$y)+0.1, by=0.1), Inf) dat2 <- transform(dat, class=cut(y, cuts, right=FALSE)) summary(dat2) ## factor1 factor2 y class ## A1:19 B1:19 Min. :0.0000 [0,0.05) : 6 ## A2:20 B2:20 1st Qu.:0.1000 [0.05,0.15):23 ## Median :0.1000 [0.15,0.25): 3 ## Mean :0.1385 [0.25,0.35): 4 ## 3rd Qu.:0.1500 [0.35,0.45): 2 ## Max. :0.5000 [0.45,0.55): 1 ## [0.55,Inf) : 0
There is no value beyond \(0.55\), hence we will fit such a categorical distribution:
where the probabilities of the classes are given by the cdf of the log-normal distribution: \[ \Pr\bigl([a,b)\bigr) = \Phi\left(\frac{\log(b)-\mu}{\sigma}\right) - \Phi\left(\frac{\log(a)-\mu}{\sigma} \right). \]
The support of the
categorical
distribution in STAN is \(1\), \(2\), \(\ldots\), \(K\), so we have to encode each class by an integer:dat2 <- transform(dat2, ycat=as.integer(class)) head(dat2, 15) ## factor1 factor2 y class ycat ## 1 A1 B1 0.1 [0.05,0.15) 2 ## 2 A1 B1 0.1 [0.05,0.15) 2 ## 3 A1 B1 0.1 [0.05,0.15) 2 ## 4 A1 B1 0.1 [0.05,0.15) 2 ## 5 A1 B1 0.1 [0.05,0.15) 2 ## 6 A1 B1 0.1 [0.05,0.15) 2 ## 7 A1 B1 0.3 [0.25,0.35) 4 ## 8 A1 B1 0.3 [0.25,0.35) 4 ## 9 A1 B1 0.1 [0.05,0.15) 2 ## 10 A1 B2 0.2 [0.15,0.25) 3 ## 11 A1 B2 0.1 [0.05,0.15) 2 ## 12 A1 B2 0.1 [0.05,0.15) 2 ## 13 A1 B2 0.1 [0.05,0.15) 2 ## 14 A1 B2 0.0 [0,0.05) 1 ## 15 A1 B2 0.0 [0,0.05) 1
Now we write and compile the STAN model:
library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) stancode <- " data { int<lower=1> N; // number of observations int<lower=1> ycat[N]; // observations int<lower=1> P; // number of parameters matrix[N,P] X; // model matrix int<lower=1> K; // number of categories vector[K-1] cuts; // the cuts 0.05, 0.15, ..., 0.55 } parameters { vector[P] beta; real<lower=0> sigma; } transformed parameters { vector[N] mu; simplex[K] p[N]; mu <- X*beta; for(i in 1:N){ p[i][1] <- Phi((log(cuts[1])-mu[i])/sigma); for(j in 2:(K-1)){ p[i][j] <- Phi((log(cuts[j])-mu[i])/sigma) - Phi((log(cuts[j-1])-mu[i])/sigma); } p[i][K] <- 1.0 - sum(p[i][1:(K-1)]); } } model { for(i in 1:N) ycat[i] ~ categorical(p[i]); beta ~ normal(0, 20); // prior on the regression coefficients sigma ~ cauchy(0, 5); // prior on the standard deviation } " # Compilation stanmodel <- stan_model(model_code = stancode, model_name="rounded 2-way ANOVA")
And we run the STAN sampler:
# Stan data K <- length(cuts)-1 X <- model.matrix(~factor1*factor2, data=dat2) standata <- list(ycat=dat2$ycat, N=nrow(dat2), K=K, cuts=cuts[2:K], X=X, P=ncol(X)) # Run Stan samples <- sampling(stanmodel, data = standata, iter = 11000, warmup = 1000, chains = 4) # Outputs library(coda) codasamples <- do.call(mcmc.list, plyr::alply(rstan::extract(samples, permuted=FALSE, pars=c("beta", "sigma")), 2, mcmc)) summary(codasamples) ## ## Iterations = 1:10000 ## Thinning interval = 1 ## Number of chains = 4 ## Sample size per chain = 10000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## beta[1] -2.1228 0.2411 0.0012055 0.0022252 ## beta[2] 0.5479 0.3240 0.0016199 0.0029783 ## beta[3] -0.5780 0.3369 0.0016843 0.0031244 ## beta[4] -0.5470 0.4629 0.0023146 0.0042449 ## sigma 0.6537 0.1032 0.0005162 0.0008849 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## beta[1] -2.60292 -2.2815 -2.1209 -1.9626 -1.65339 ## beta[2] -0.09319 0.3348 0.5483 0.7637 1.18396 ## beta[3] -1.24907 -0.7990 -0.5791 -0.3557 0.08303 ## beta[4] -1.46832 -0.8486 -0.5450 -0.2444 0.37006 ## sigma 0.48257 0.5813 0.6429 0.7145 0.88888
The estimates of the regression parameters are almost the same as the ones provided by the
grouped
package, and confidence intervals are a bit larger. The estimate of \(\sigma\) is a bit different.Of course, the major advantage of the Bayesian way is that it can be used for any parametric model, not only the linear regression models.