1. Dealing with rounded data

    2016-03-24
    Source

    Real “continuous” data are always rounded. For instance, I already had to deal with these data:

    head(dat, 15)
    ##    factor1 factor2   y
    ## 1       A1      B1 0.1
    ## 2       A1      B1 0.1
    ## 3       A1      B1 0.1
    ## 4       A1      B1 0.1
    ## 5       A1      B1 0.1
    ## 6       A1      B1 0.1
    ## 7       A1      B1 0.3
    ## 8       A1      B1 0.3
    ## 9       A1      B1 0.1
    ## 10      A1      B2 0.2
    ## 11      A1      B2 0.1
    ## 12      A1      B2 0.1
    ## 13      A1      B2 0.1
    ## 14      A1      B2 0.0
    ## 15      A1      B2 0.0

    These data were recorded by a measurement device with one decimal precision. Thus, a value of \(0.1\) actually means the value lies between \(0.05\) and \(0.15\). A value of \(0\) actually means the value lies beteen \(0\) and \(0.05\) (these are nonnegative measurements).

    In fact these are intervals data:

    dat1 <- transform(dat, low = pmax(0,y-0.05), up = y+0.05)
    head(dat1, 15)
    ##    factor1 factor2   y  low   up
    ## 1       A1      B1 0.1 0.05 0.15
    ## 2       A1      B1 0.1 0.05 0.15
    ## 3       A1      B1 0.1 0.05 0.15
    ## 4       A1      B1 0.1 0.05 0.15
    ## 5       A1      B1 0.1 0.05 0.15
    ## 6       A1      B1 0.1 0.05 0.15
    ## 7       A1      B1 0.3 0.25 0.35
    ## 8       A1      B1 0.3 0.25 0.35
    ## 9       A1      B1 0.1 0.05 0.15
    ## 10      A1      B2 0.2 0.15 0.25
    ## 11      A1      B2 0.1 0.05 0.15
    ## 12      A1      B2 0.1 0.05 0.15
    ## 13      A1      B2 0.1 0.05 0.15
    ## 14      A1      B2 0.0 0.00 0.05
    ## 15      A1      B2 0.0 0.00 0.05

    Thus, assuming for instance that the true values of the measurements follow a log-normal distribution:

    then one should use a rounded log-normal distribution to model the data:

    By the way, one would get a problem if one intended to fit a log-normal distribution to the y values, because there are some zero values.

    Using the grouped package

    One way to deal with this issue is to use the grouped R package. It allows to fit linear regression models to grouped data. It is very easy to use:

    library(grouped)
    fit <- grouped(cbind(low, up) ~  factor1*factor2, link="log", data=dat1)
    summary(fit)
    ## 
    ## Call:
    ## grouped(formula = cbind(low, up) ~ factor1 * factor2, link = "log",     data = dat1)
    ## 
    ## Model Summary:
    ##  log.Lik    AIC     BIC
    ##  -44.711 99.421 107.739
    ## 
    ## Coefficients:
    ##                     Esimate Std.error t.value p.value
    ## (Intercept)          -2.111     0.217   -9.73  <0.001
    ## factor1A2             0.551     0.289    1.90   0.065
    ## factor2B2            -0.574     0.301   -1.90   0.065
    ## factor1A2:factor2B2  -0.551     0.414   -1.33   0.193
    ## 
    ## Random-Effect:
    ##       value std.error link.distribution
    ## sigma 0.579    0.0821        log-normal
    ## 
    ## Optimization:
    ## Convergence: 0 
    ## max(|grad|): 0.00015 
    ##  Outer iter: 1

    The grouped package provides confidence intervals “\(\textrm{estimate}\pm z_{1-\frac{\alpha}{2}}\textrm{stderr}\)”:

    confint(fit)
    ##                           2.5 %     97.5 %
    ## (Intercept)         -2.53560037 -1.6854123
    ## factor1A2           -0.01637233  1.1174238
    ## factor2B2           -1.16437699  0.0168808
    ## factor1A2:factor2B2 -1.36285840  0.2617881

    This method to get confidence intervals is an asymptotic one, and they are possibly too short for small sample sizes.

    A Bayesian solution using STAN

    With STAN, one can assign a rounded log-normal distribution to the observations with the help of the categorical distribution.

    We use the cut function to create the classes of the measurements:

    cuts <- c(0, seq(0.05, max(dat$y)+0.1, by=0.1), Inf)
    dat2 <- transform(dat, class=cut(y, cuts, right=FALSE))
    summary(dat2)
    ##  factor1 factor2       y                  class   
    ##  A1:19   B1:19   Min.   :0.0000   [0,0.05)   : 6  
    ##  A2:20   B2:20   1st Qu.:0.1000   [0.05,0.15):23  
    ##                  Median :0.1000   [0.15,0.25): 3  
    ##                  Mean   :0.1385   [0.25,0.35): 4  
    ##                  3rd Qu.:0.1500   [0.35,0.45): 2  
    ##                  Max.   :0.5000   [0.45,0.55): 1  
    ##                                   [0.55,Inf) : 0

    There is no value beyond \(0.55\), hence we will fit such a categorical distribution:

    where the probabilities of the classes are given by the cdf of the log-normal distribution: \[ \Pr\bigl([a,b)\bigr) = \Phi\left(\frac{\log(b)-\mu}{\sigma}\right) - \Phi\left(\frac{\log(a)-\mu}{\sigma} \right). \]

    The support of the categorical distribution in STAN is \(1\), \(2\), \(\ldots\), \(K\), so we have to encode each class by an integer:

    dat2 <- transform(dat2, ycat=as.integer(class))
    head(dat2, 15)
    ##    factor1 factor2   y       class ycat
    ## 1       A1      B1 0.1 [0.05,0.15)    2
    ## 2       A1      B1 0.1 [0.05,0.15)    2
    ## 3       A1      B1 0.1 [0.05,0.15)    2
    ## 4       A1      B1 0.1 [0.05,0.15)    2
    ## 5       A1      B1 0.1 [0.05,0.15)    2
    ## 6       A1      B1 0.1 [0.05,0.15)    2
    ## 7       A1      B1 0.3 [0.25,0.35)    4
    ## 8       A1      B1 0.3 [0.25,0.35)    4
    ## 9       A1      B1 0.1 [0.05,0.15)    2
    ## 10      A1      B2 0.2 [0.15,0.25)    3
    ## 11      A1      B2 0.1 [0.05,0.15)    2
    ## 12      A1      B2 0.1 [0.05,0.15)    2
    ## 13      A1      B2 0.1 [0.05,0.15)    2
    ## 14      A1      B2 0.0    [0,0.05)    1
    ## 15      A1      B2 0.0    [0,0.05)    1

    Now we write and compile the STAN model:

    library(rstan)
    rstan_options(auto_write = TRUE)
    options(mc.cores = parallel::detectCores())
    stancode <- "
    data {
      int<lower=1> N;       // number of observations
      int<lower=1> ycat[N]; // observations
      int<lower=1> P;       // number of parameters
      matrix[N,P] X;        // model matrix
      int<lower=1> K;       // number of categories
      vector[K-1] cuts;     // the cuts 0.05, 0.15, ..., 0.55
    }
    parameters {
      vector[P] beta; 
      real<lower=0> sigma;
    }
    transformed parameters {
      vector[N] mu;
      simplex[K] p[N];
      mu <- X*beta;
      for(i in 1:N){
        p[i][1] <- Phi((log(cuts[1])-mu[i])/sigma);
        for(j in 2:(K-1)){
          p[i][j] <-  Phi((log(cuts[j])-mu[i])/sigma) - Phi((log(cuts[j-1])-mu[i])/sigma);
        }
      p[i][K] <- 1.0 - sum(p[i][1:(K-1)]);
      }
    }
    model {
      for(i in 1:N) ycat[i] ~ categorical(p[i]);
      beta ~ normal(0, 20); // prior on the regression coefficients
      sigma ~ cauchy(0, 5); // prior on the standard deviation
    }
    "
    # Compilation
    stanmodel <- stan_model(model_code = stancode, model_name="rounded 2-way ANOVA")

    And we run the STAN sampler:

    # Stan data
    K <- length(cuts)-1
    X <- model.matrix(~factor1*factor2, data=dat2)
    standata <- list(ycat=dat2$ycat, N=nrow(dat2), K=K, cuts=cuts[2:K], X=X, P=ncol(X))
    # Run Stan
    samples <- sampling(stanmodel, data = standata, 
                        iter = 11000, warmup = 1000, chains = 4)
    # Outputs
    library(coda)
    codasamples <- do.call(mcmc.list, 
        plyr::alply(rstan::extract(samples, permuted=FALSE, 
                                    pars=c("beta", "sigma")), 2, mcmc))
    summary(codasamples)
    ## 
    ## Iterations = 1:10000
    ## Thinning interval = 1 
    ## Number of chains = 4 
    ## Sample size per chain = 10000 
    ## 
    ## 1. Empirical mean and standard deviation for each variable,
    ##    plus standard error of the mean:
    ## 
    ##            Mean     SD  Naive SE Time-series SE
    ## beta[1] -2.1228 0.2411 0.0012055      0.0022252
    ## beta[2]  0.5479 0.3240 0.0016199      0.0029783
    ## beta[3] -0.5780 0.3369 0.0016843      0.0031244
    ## beta[4] -0.5470 0.4629 0.0023146      0.0042449
    ## sigma    0.6537 0.1032 0.0005162      0.0008849
    ## 
    ## 2. Quantiles for each variable:
    ## 
    ##             2.5%     25%     50%     75%    97.5%
    ## beta[1] -2.60292 -2.2815 -2.1209 -1.9626 -1.65339
    ## beta[2] -0.09319  0.3348  0.5483  0.7637  1.18396
    ## beta[3] -1.24907 -0.7990 -0.5791 -0.3557  0.08303
    ## beta[4] -1.46832 -0.8486 -0.5450 -0.2444  0.37006
    ## sigma    0.48257  0.5813  0.6429  0.7145  0.88888

    The estimates of the regression parameters are almost the same as the ones provided by the grouped package, and confidence intervals are a bit larger. The estimate of \(\sigma\) is a bit different.

    Of course, the major advantage of the Bayesian way is that it can be used for any parametric model, not only the linear regression models.