1. Generalized confidence intervals in the 'BAV1R'

    2017-03-26
    Source

    Let us consider the balanced one-way random effect ANOVA model, that we already discussed in two previous articles, here and here.

    In the former, I provided a confidence interval about the total coefficient of variation, and I claimed I didn’t remember how I derived it. Now I remember. It is a so-called generalized confidence interval.

    We denote by \(y_{ij}\) the \(j\)-th observation in group \(i\), for \(1 \leqslant i \leqslant I\) and \(1 \leqslant j \leqslant J\), and we denote by \(Y_{ij}\) the random variable modelling \(y_{ij}\).

    Recall that the three summary statistics \(\bar Y_{\bullet\bullet}\) (overall mean), \({\mathrm{SS}}_b\) (between sum-of-squares) and \({\mathrm{SS}}_w\) (within sum-of-squares) are independent and their distributions are given by \[ Z \sim {\cal N}(0,1), \quad U^2_b \sim \chi^2_{I-1}, \quad U^2_w \sim \chi^2_{I(J-1)} \] where \[ \begin{align} Z & = \sqrt{IJ} \frac{\bar Y_{\bullet\bullet}-\mu}{\sqrt{J\sigma^2_b+\sigma^2_w}},\\ U^2_b & = \frac{{\mathrm{SS}}_b}{J\sigma^2_b+\sigma^2_w},\\ U^2_w & = \frac{{\mathrm{SS}}_w}{\sigma^2_w}. \end{align} \]

    We use notations \(\bar y_{\bullet\bullet}\), \({\mathrm{ss}}_b\), \({\mathrm{ss}}_w\) to denote the actual observations of these three summary statistics. They are considered as non-random.

    Define the three following strange random variables: \[ \begin{align} G_\mu & = \bar y_{\bullet\bullet} - \sqrt{IJ} \frac{\bar Y_{\bullet\bullet}-\mu}{\sqrt{{\mathrm{SS}}_b}} \sqrt{\frac{{\mathrm{ss}}_b}{IJ}}, \\ G_{\sigma^2_b} & = \frac{J\sigma^2_b+\sigma^2_w}{J {\mathrm{SS}}_b} {\mathrm{ss}}_b - \frac{\sigma^2_w}{J {\mathrm{SS}}_w}{\mathrm{ss}}_w, \\ G_{\sigma^2_w} & = \frac{\sigma^2_w}{{\mathrm{SS}}_w}{\mathrm{ss}}_w. \end{align} \]

    They are strange because they involve the actual observations.

    They enjoy the two following properties. Firstly, it is very easy to check that, if \((\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w) = (\bar y_{\bullet\bullet}, {\mathrm{ss}}_b, {\mathrm{ss}}_w)\), then \[ (G_\mu, G_{\sigma^2_b}, G_{\sigma^2_w}) = (\mu, \sigma^2_b, \sigma^2_w). \]

    Secondly, it is not difficult to check that, while the expressions of these three strange random variables involve the unknown parameters, the distributions of these random variables does not involve any unknown parameter. Indeed, with the previous notations, \[ \begin{align} G_\mu & = \bar y_{\bullet\bullet} -\frac{Z}{\sqrt{U^2_b}}\sqrt{\frac{{\mathrm{ss}}_b}{IJ}}, \\ G_{\sigma^2_b} & = \frac{1}{J}\left(\frac{{\mathrm{ss}}_b}{U^2_b} - \frac{{\mathrm{ss}}_w}{U^2_w}\right), \\ G_{\sigma^2_w} & = \frac{{\mathrm{ss}}_w}{U^2_w}. \end{align} \]

    Therefore, these variables can be easily simulated, and it is known that (see Krishnamoorthy & Mathew’s book), if \(f\) is a suitable function, then the quantiles of \(f(\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w)\) are confidence bounds of \(f(\mu, \sigma^2_b, \sigma^2_w)\). More precisely, if we denote by \(q_f(p)\) the \(100p\%\)-quantile of \(f(\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w)\), then, for instance, \(\left[q_f\left(\frac{\alpha}{2}\right), q_f\left(1-\frac{\alpha}{2}\right)\right]\) is a \(100(1-\alpha)\%\)-confidence interval about \(f(\mu, \sigma^2_b, \sigma^2_w)\).

    This is a so-called generalized confidence interval. Such confidence intervals are not exact, but it is known that they are asymptotically exact, and their coverage is often close to the nominal level.

    Example

    Let’s write a function simulating a dataset.

    library(data.table)
    Sim <- function(I, J, mu, sigmab, sigmaw){
      group <- gl(I, J, labels=LETTERS[1:I])
      DT <- data.table(group = group, 
                       y = mu + rep(rnorm(I, sd=sigmab), each=J) + rnorm(I*J, sd=sigmaw),
                       key = "group")
      return(DT)
    }
    I <- 2L; J <- 3L
    mu <- 0; sigmab <- 1; sigmaw <- 2
    set.seed(666L)
    ( DT <- Sim(I, J, mu, sigmab, sigmaw) )
    ##    group           y
    ## 1:     A  0.04304213
    ## 2:     A  4.80964673
    ## 3:     A -3.68043786
    ## 4:     B  3.53114702
    ## 5:     B -0.59801585
    ## 6:     B  0.40931553

    And let’s write a function calculating the summary statistics from such a dataset.

    summaryStats <- function(DT){
      DT[, `:=`(means = rep(mean(y), each=.N)), by=group]
      ssw <- DT[, { squares = (y-means)^2
                    .(ssw = sum(squares))}]$ssw
      ybar <- mean(DT$y)
      DT[, `:=`(Mean = ybar)]
      ssb <- DT[, { squares = (means-Mean)^2
                    .(ssb = sum(squares))}]$ssb
      return(c(ybar=ybar, ssb=ssb, ssw=ssw))
    }
    ( stats <- summaryStats(DT) )
    ##       ybar        ssb        ssw 
    ##  0.7524496  0.7849582 45.4922978
    ybar <- stats["ybar"]
    ssb <- stats["ssb"]
    ssw <- stats["ssw"]

    Consider a crazy parameter of interest \[ \theta = \mu + \sigma^2_b + \log(\sigma^2_w). \]

    f <- function(mu, sigma2b, sigma2w) mu + sigma2b + log(sigma2w)

    In our example, the true value of \(\theta\) is:

    ( theta0 <- f(mu, sigmab^2, sigmaw^2) )
    ## [1] 2.386294

    We get a confidence interval about \(\theta\) as follows:

    n <- 50000L
    Z <- rnorm(n)
    U2b <- rchisq(n, I-1)
    U2w <- rchisq(n, I*(J-1))
    Gmu <- ybar - Z/sqrt(U2b)*sqrt(ssb/I/J)
    Gsigma2b <- 1/J*(ssb/U2b - ssw/U2w)
    Gsigma2w <- ssw/U2w
    quantile(f(Gmu, Gsigma2b, Gsigma2w), c(0.025, 0.975))
    ##      2.5%     97.5% 
    ## -23.55456 252.50211

    Checking the coverage

    Let’s check the coverage probability of this interval for our scenario.

    nsims <- 20000
    SIMS <- t(vapply(1:nsims, function(i){
       summaryStats(Sim(I, J, mu, sigmab, sigmaw))
      }, 
      FUN.VALUE=numeric(3)))
    lower <- upper <- numeric(nsims)
    for(i in 1:nsims){
      ybar <- SIMS[i,"ybar"]
      ssb <- SIMS[i,"ssb"]
      ssw <- SIMS[i,"ssw"]
      Gmu <- ybar - Z/sqrt(U2b)*sqrt(ssb/I/J)
      Gsigma2b <- 1/J*(ssb/U2b - ssw/U2w)
      Gsigma2w <- ssw/U2w
      theta <- f(Gmu, Gsigma2b, Gsigma2w)
      lower[i] <- quantile(theta, 0.025)
      upper[i] <- quantile(theta, 0.975)
    }
    mean(theta0 > lower)
    ## [1] 0.98875
    mean(theta0 < upper)
    ## [1] 0.9728
    mean(theta0 > lower & theta0 < upper)
    ## [1] 0.96155

    Thus, the generalized confidence interval is a bit conservative in this example (really far from the asymptoticness).