-
Generalized confidence intervals in the 'BAV1R'
2017-03-26
SourceLet us consider the balanced one-way random effect ANOVA model, that we already discussed in two previous articles, here and here.
In the former, I provided a confidence interval about the total coefficient of variation, and I claimed I didn’t remember how I derived it. Now I remember. It is a so-called generalized confidence interval.
We denote by \(y_{ij}\) the \(j\)-th observation in group \(i\), for \(1 \leqslant i \leqslant I\) and \(1 \leqslant j \leqslant J\), and we denote by \(Y_{ij}\) the random variable modelling \(y_{ij}\).
Recall that the three summary statistics \(\bar Y_{\bullet\bullet}\) (overall mean), \({\mathrm{SS}}_b\) (between sum-of-squares) and \({\mathrm{SS}}_w\) (within sum-of-squares) are independent and their distributions are given by \[ Z \sim {\cal N}(0,1), \quad U^2_b \sim \chi^2_{I-1}, \quad U^2_w \sim \chi^2_{I(J-1)} \] where \[ \begin{align} Z & = \sqrt{IJ} \frac{\bar Y_{\bullet\bullet}-\mu}{\sqrt{J\sigma^2_b+\sigma^2_w}},\\ U^2_b & = \frac{{\mathrm{SS}}_b}{J\sigma^2_b+\sigma^2_w},\\ U^2_w & = \frac{{\mathrm{SS}}_w}{\sigma^2_w}. \end{align} \]
We use notations \(\bar y_{\bullet\bullet}\), \({\mathrm{ss}}_b\), \({\mathrm{ss}}_w\) to denote the actual observations of these three summary statistics. They are considered as non-random.
Define the three following strange random variables: \[ \begin{align} G_\mu & = \bar y_{\bullet\bullet} - \sqrt{IJ} \frac{\bar Y_{\bullet\bullet}-\mu}{\sqrt{{\mathrm{SS}}_b}} \sqrt{\frac{{\mathrm{ss}}_b}{IJ}}, \\ G_{\sigma^2_b} & = \frac{J\sigma^2_b+\sigma^2_w}{J {\mathrm{SS}}_b} {\mathrm{ss}}_b - \frac{\sigma^2_w}{J {\mathrm{SS}}_w}{\mathrm{ss}}_w, \\ G_{\sigma^2_w} & = \frac{\sigma^2_w}{{\mathrm{SS}}_w}{\mathrm{ss}}_w. \end{align} \]
They are strange because they involve the actual observations.
They enjoy the two following properties. Firstly, it is very easy to check that, if \((\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w) = (\bar y_{\bullet\bullet}, {\mathrm{ss}}_b, {\mathrm{ss}}_w)\), then \[ (G_\mu, G_{\sigma^2_b}, G_{\sigma^2_w}) = (\mu, \sigma^2_b, \sigma^2_w). \]
Secondly, it is not difficult to check that, while the expressions of these three strange random variables involve the unknown parameters, the distributions of these random variables does not involve any unknown parameter. Indeed, with the previous notations, \[ \begin{align} G_\mu & = \bar y_{\bullet\bullet} -\frac{Z}{\sqrt{U^2_b}}\sqrt{\frac{{\mathrm{ss}}_b}{IJ}}, \\ G_{\sigma^2_b} & = \frac{1}{J}\left(\frac{{\mathrm{ss}}_b}{U^2_b} - \frac{{\mathrm{ss}}_w}{U^2_w}\right), \\ G_{\sigma^2_w} & = \frac{{\mathrm{ss}}_w}{U^2_w}. \end{align} \]
Therefore, these variables can be easily simulated, and it is known that (see Krishnamoorthy & Mathew’s book), if \(f\) is a suitable function, then the quantiles of \(f(\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w)\) are confidence bounds of \(f(\mu, \sigma^2_b, \sigma^2_w)\). More precisely, if we denote by \(q_f(p)\) the \(100p\%\)-quantile of \(f(\bar Y_{\bullet\bullet}, {\mathrm{SS}}_b, {\mathrm{SS}}_w)\), then, for instance, \(\left[q_f\left(\frac{\alpha}{2}\right), q_f\left(1-\frac{\alpha}{2}\right)\right]\) is a \(100(1-\alpha)\%\)-confidence interval about \(f(\mu, \sigma^2_b, \sigma^2_w)\).
This is a so-called generalized confidence interval. Such confidence intervals are not exact, but it is known that they are asymptotically exact, and their coverage is often close to the nominal level.
Example
Let’s write a function simulating a dataset.
library(data.table) Sim <- function(I, J, mu, sigmab, sigmaw){ group <- gl(I, J, labels=LETTERS[1:I]) DT <- data.table(group = group, y = mu + rep(rnorm(I, sd=sigmab), each=J) + rnorm(I*J, sd=sigmaw), key = "group") return(DT) } I <- 2L; J <- 3L mu <- 0; sigmab <- 1; sigmaw <- 2 set.seed(666L) ( DT <- Sim(I, J, mu, sigmab, sigmaw) ) ## group y ## 1: A 0.04304213 ## 2: A 4.80964673 ## 3: A -3.68043786 ## 4: B 3.53114702 ## 5: B -0.59801585 ## 6: B 0.40931553
And let’s write a function calculating the summary statistics from such a dataset.
summaryStats <- function(DT){ DT[, `:=`(means = rep(mean(y), each=.N)), by=group] ssw <- DT[, { squares = (y-means)^2 .(ssw = sum(squares))}]$ssw ybar <- mean(DT$y) DT[, `:=`(Mean = ybar)] ssb <- DT[, { squares = (means-Mean)^2 .(ssb = sum(squares))}]$ssb return(c(ybar=ybar, ssb=ssb, ssw=ssw)) } ( stats <- summaryStats(DT) ) ## ybar ssb ssw ## 0.7524496 0.7849582 45.4922978 ybar <- stats["ybar"] ssb <- stats["ssb"] ssw <- stats["ssw"]
Consider a crazy parameter of interest \[ \theta = \mu + \sigma^2_b + \log(\sigma^2_w). \]
f <- function(mu, sigma2b, sigma2w) mu + sigma2b + log(sigma2w)
In our example, the true value of \(\theta\) is:
( theta0 <- f(mu, sigmab^2, sigmaw^2) ) ## [1] 2.386294
We get a confidence interval about \(\theta\) as follows:
n <- 50000L Z <- rnorm(n) U2b <- rchisq(n, I-1) U2w <- rchisq(n, I*(J-1)) Gmu <- ybar - Z/sqrt(U2b)*sqrt(ssb/I/J) Gsigma2b <- 1/J*(ssb/U2b - ssw/U2w) Gsigma2w <- ssw/U2w quantile(f(Gmu, Gsigma2b, Gsigma2w), c(0.025, 0.975)) ## 2.5% 97.5% ## -23.55456 252.50211
Checking the coverage
Let’s check the coverage probability of this interval for our scenario.
nsims <- 20000 SIMS <- t(vapply(1:nsims, function(i){ summaryStats(Sim(I, J, mu, sigmab, sigmaw)) }, FUN.VALUE=numeric(3))) lower <- upper <- numeric(nsims) for(i in 1:nsims){ ybar <- SIMS[i,"ybar"] ssb <- SIMS[i,"ssb"] ssw <- SIMS[i,"ssw"] Gmu <- ybar - Z/sqrt(U2b)*sqrt(ssb/I/J) Gsigma2b <- 1/J*(ssb/U2b - ssw/U2w) Gsigma2w <- ssw/U2w theta <- f(Gmu, Gsigma2b, Gsigma2w) lower[i] <- quantile(theta, 0.025) upper[i] <- quantile(theta, 0.975) } mean(theta0 > lower) ## [1] 0.98875 mean(theta0 < upper) ## [1] 0.9728 mean(theta0 > lower & theta0 < upper) ## [1] 0.96155
Thus, the generalized confidence interval is a bit conservative in this example (really far from the asymptoticness).