-
Inference in the balanced one-way random effect ANOVA
2015-03-12
SourceThe one-way random effect ANOVA model is the simplest linear Gaussian model with random effects: one random factor, and no other factor or covariate. However, methods of inference to get estimates, confidence intervals for various parameters (variance components, coefficient of variation), prediction intervals and tolerance intervals, are not so well-known. This is what we provide in this article, in the simplest situation : the case of balanced data (at the end we give a possible way to handle the unbalanced situation).
Recall that, denoting by \(y_{ij}\) the \(j\)-th response in group \(i\). this model assumes that the responses \(y_{ij}\) are generated in two steps. First, the group means \(\mu_i\) are independently generated according to a Gaussian distribution \({\cal N}(\mu, \sigma^2_b)\) where \(\mu\) is the overall mean and \(\sigma^2_b\) is the so-called between variance. Second, the responses \(y_{ij}\), \(j =1,\ldots,J\), for each group \(i\), are independently distributed according to a Gaussian distribution \({\cal N}(\mu_i, \sigma^2_w)\) with within variance \(\sigma^2_w\) and mean \(\mu_i\). Shortly, the model can be written: \[\begin{cases} \mu_i \sim_{\text{iid}} {\cal N}(\mu, \sigma^2_b) & i=1,\ldots,I \\ (y_{ij} \mid \mu_i) \sim_{\text{iid}} {\cal N}(\mu_i, \sigma^2_w) & j=1,\ldots,J \end{cases}.\]
If you like the moment of inertia representation of the variance on this picture (I do), see my article about it.
Throughout the article, we will use the following function to get the within sum-of-squares \(SS_w\) and the between sum-of-squares \(SS_b\):
SOS <- function(y, group){ require(dplyr) return(data.frame(y=y,group=group) %>% mutate(mean=mean(y)) %>% group_by(group) %>% mutate(groupmean=mean(y)) %>% ungroup %>% do(summarise(., SSw=crossprod(y-groupmean), SSb=crossprod(groupmean-mean))) %>% as.list) }
Bayesian posterior distributions
All Bayesian methods we provide are based on the posterior distributions generated by this code. I???m using a
data.table
for storing the stimulations.ranova <- function(y, group, nsims=50000){ require(data.table) group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sum of squares SSb SSw SSb <- SS$SSb; SSw <- SS$SSw omean <- mean(y) k <- tot <- 0 z <- rnorm(nsims) ETAb <- ETAw <- rep(NA,nsims) while(k < nsims){ tot <- tot +1 etab <- SSb/rchisq(1, I-1) etaw <- SSw/rchisq(1, I*(J-1)) if(etab>etaw){ k <- k+1 ETAb[k] <- etab ETAw[k] <- etaw } } out <- data.table(mu=omean + sqrt(ETAb/I/J)*z, sigma2w=ETAw, sigma2b=(ETAb-ETAw)/J) reject <- 100*(tot-nsims)/tot cat("Rejection percentage: ", round(reject,1), "%") setattr(out, "reject", reject) return(out) }
These posterior distributions can be found in Krishnamoorthy & Mathew???s book. I believe they are derived in Box & Tiao???s book.
As we can see in the body of the function, the algorithm firstly generates simulations of a couple \((\eta_w, \eta_b)\) whose joint distribution is given two inverse-Gamma distributions \({\cal IG}\left(\frac{I(J-1)}{2},\frac{SS_w}{2}\right)\) and \({\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)\) conditionned to \(\eta_b > \eta_w\), and then \(\sigma^2_w=\eta_w\) \(\sigma^2_b=\frac{\eta_b-\eta_w}{J}\). The
ranova
function also prints the percentage of rejected simulations (those for whichetab<etaw
) and also returns it as an attribute of thedata.table
containing the simulations.Note that \(\eta_b = J\tau^2\) where \(\tau^2=\sigma_b^2+\frac{\sigma_w^2}{J}\) is the variance of a group mean \(\bar y_{i\bullet}\). Given the sum of squares \(SS_w\) and \(SS_b\), the probability of rejection is, using abusive notation, \[\begin{align} R(SS_w,SS_b) & = \Pr\left(\frac{\chi^2_{I(J-1)}}{\chi^2_{I-1}} < \frac{SS_w}{SS_b}\right) \\ & = \Pr\left({\cal Beta}'\left(\frac{I(J-1)}{2}, \frac{I-1}{2}\right) < \frac{SS_w}{ SS_b} \right) \\ & = \Pr\left({\cal Beta}\left(\frac{I(J-1)}{2}, \frac{I-1}{2}\right) < \frac{SS_w}{SS_w + SS_b} \right). \end{align}\]
Moreover, as is well known or as shown here, the sum of squares \(SS_w\) and \(SS_b\) are independent and \(SS_w \sim \sigma^2_w\chi^2_{I(J-1)}\) and \(SS_b \sim J\tau^2\chi^2_{I-1}\).
Therefore the rejection probability over repeated sampling is \[ R = \Pr\left(\frac{{\cal Beta}'\left(\frac{I(J-1)}{2}, \frac{I-1}{2}\right)}{{\cal Beta}'\left(\frac{I(J-1)}{2}, \frac{I-1}{2}\right)} < \frac{\sigma^2_w}{J\tau^2} \right). \] Let \(\rho:=\dfrac{\sigma^2_w}{\sigma^2_w+\sigma^2_b}\) be the ratio of the within-variance over the total variance, and note that \(\dfrac{\sigma^2_w}{J\tau^2} = \dfrac{\rho}{\rho+J(1-\rho)}\) (an increasing function of \(\rho\)). There is no elementary cumulative distribution fonction for the ratio of the two independent Beta prime distributions, but we can easily plot \(R\) in function of \(\rho\) with the help of simulations. We use Javascript in order to include interactivity for the choice of \(I\) and \(J\), and the Javascript flot library to render the plot:
As an exercise, you can check why the probability of rejection always is \(50\%\) for \(\rho=1\).
Simulated data
Throughout the article, we will illustrate the methods on two datasets,
dat1
anddat2
. I load themagrittr
package to use the pipe operator%>%
.library(magrittr) set.seed(314) I <- 3 J <- 6 mu <- 10 sigmab <- 2 sigmaw <- 1 dat1 <- data.frame(sapply(rnorm(I,mu,sigmab), function(mu) rnorm(J,mu,sigmaw))) %>% setNames(LETTERS[1:I]) %>% stack %>% setNames(c("y", "group")) sigmab <- 1 dat2 <- data.frame(sapply(rnorm(I,mu,sigmab), function(mu) rnorm(J,mu,sigmaw))) %>% setNames(LETTERS[1:I]) %>% stack %>% setNames(c("y", "group")) # stacked datasets dat <- rbind(cbind(data="dat1", dat1), cbind(data="dat2", dat2))
The between-variance \(\sigma_b^2\) is the only difference between the parameters used to generate our two datasets (\(4\) vs \(1\)). We can already store the posterior simulations of the Bayesian method:
sims1 <- with(dat1, ranova(y, group, nsims=1e6)) ## Rejection percentage: 0 % sims2 <- with(dat2, ranova(y, group, nsims=1e6)) ## Rejection percentage: 19.2 % # stacked simulations in a data.table sims <- rbind(sims1[, "data":="data1"], sims2[,"data":="data2"]) setkey(sims, data)
Note that the theoretical rejection percentages could be calculated with our above formula of \(R(SS_w,SS_b)\):
by(dat, dat$data, FUN=function(data){ with(data, with(SOS(y,group), round(100*pbeta(SSw/(SSw+SSb), I*(J-1)/2, (I-1)/2), 1) ) ) }) ## dat$data: dat1 ## [1] 0 ## -------------------------------------------------------- ## dat$data: dat2 ## [1] 19.1
Because there is no rejection for the first dataset, the posterior distribution of \(\eta_b\) is \({\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)\), as confirmed by the density plot below (see this stats.stackexchange discussion if you wonder why I use the
logspline
package):with(dat1, { ssb <- J*crossprod(aggregate(y~group, FUN=mean)$y-mean(y)) curve(dgamma(x,(I-1)/2, ssb/2), to=0.4, lwd=2, xlab=NA, ylab=NA, axes=FALSE) }) axis(1) library(logspline) x <- 1/with(sims1[sample(nrow(sims1),1e4)], J*sigma2b+sigma2w) plot(logspline(x), add = TRUE, col = "red", lwd = 2)
But not for the second dataset:
with(dat2, { ssb <- J*crossprod(aggregate(y~group, FUN=mean)$y-mean(y)) curve(dgamma(x,(I-1)/2, ssb/2), to=1, ylim=c(0,2.6), lwd=2, xlab=NA, ylab=NA, axes=FALSE) }) axis(1) lines(density(1/with(sims2, J*sigma2b+sigma2w)), col="red", lwd=2)
Confidence interval about the overall mean
Frequentist method
There???s a straightforward way to get a confidence interval about the overall mean \(\mu\): just take the group means and draw a usual Gaussian confidence interval about their mean:
by(dat, dat$data, function(data){ aggregate(y~group, data=data, FUN=mean) %>% lm(y~1, .) %>% confint }) ## dat$data: dat1 ## 2.5 % 97.5 % ## (Intercept) 4.304683 13.03132 ## -------------------------------------------------------- ## dat$data: dat2 ## 2.5 % 97.5 % ## (Intercept) 8.419673 11.37769
Bayesian method
Throughout the article we always use equi-tailed credible intervals.
quantile(sims["data1"]$mu, c(2.5,97.5)/100) ## 2.5% 97.5% ## 4.29467 13.03949 quantile(sims["data2"]$mu, c(2.5,97.5)/100) ## 2.5% 97.5% ## 8.242078 11.563506
For the first dataset, the Bayesian interval is close to the frequentist interval. We can see why: when \(R(SS_w,SS_b)=0\), the posterior distribution of \(\eta_b\) is \(\eta_b \sim {\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)\) and then the posterior distribution of \(\mu\) is a non-standardized Student distribution with \(I-1\) degrees of freedom, location parameter \(\bar y_{\bullet\bullet}\), and scale \(\sqrt{\frac{SS_b}{IJ(I-1)}}\). Therefore the Bayesian interval exactly coincides with the frequentist interval when \(R(SS_w,SS_b)=0\), and this is the case for the first dataset. Otherwise, when \(R(SS_w,SS_b)=0\), the Bayesian interval is wider.
Yes, probabilists, I know \(R(SS_w,SS_b)\) is never exactly \(0\) !
Confidence interval about variance components
Frequentist method
The \(\chi^2\) distribution of \(SS_w\) easily allows to derive a confidence interval about the within-variance \(\sigma^2_w\). By the way it is the same as the case of the fixed effects model.
Things are more complicated for the between-variance \(\sigma^2_b\) and the total variance \(\sigma^2_w+\sigma^2_b\). The function
confintVC
below provides confidence intervals following the method given by Burdick & al, based on Graybill & Wang???s confidence intervals for linear combinations of \(\chi^2\) distributions.confintVC <- function(y, group, alpha=5/100){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I # DFb <- I-1 # between df DFw <- I*(J-1) # within df SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw MSSb <- SSb/DFb; MSSw <- SSw/DFw # mean sum of squares # Confidence Intervals a <- alpha/2 ## Within variance confidence interval sigma2w <- MSSw # estimate withinLCB <- sigma2w/qf(1-a,DFw,Inf) # Within lwr withinUCB <- sigma2w/qf(a,DFw,Inf) # Within upr ## Between variance confidence interval sigma2b <- (MSSb-MSSw)/J # estimate G1 <- 1-(1/qf(1-a,DFb,Inf)) G2 <- 1-(1/qf(1-a,DFw,Inf)) H1 <- (1/qf(a,DFb,Inf))-1 H2 <- (1/qf(a,DFw,Inf))-1 G12 <- ((qf(1-a,DFb,DFw)-1)^2-(G1^2*qf(1-a,DFb,DFw)^2)-(H2^2))/qf(1-a,DFb,DFw) H12 <- ((1-qf(a,DFb,DFw))^2-H1^2*qf(a,DFb,DFw)^2-G2^2)/qf(a,DFb,DFw) Vu <- H1^2*MSSb^2+G2^2*MSSw^2+H12*MSSb*MSSw Vl <- G1^2*MSSb^2+H2^2*MSSw^2+G12*MSSw*MSSb betweenLCB <- (MSSb-MSSw-sqrt(Vl))/J # Betwen lwr betweenUCB <- (MSSb-MSSw+sqrt(Vu))/J # Between upr ## Total variance confidence interval sigma2tot <- sigma2w+sigma2b # estimate totalLCB <- sigma2tot - (sqrt(G1^2*MSSb^2+G2^2*(J-1)^2*MSSw^2)/J) # Total lwr totalUCB <- sigma2tot + (sqrt(H1^2*MSSb^2+H2^2*(J-1)^2*MSSw^2)/J) # Total upr # Output out <- data.frame( estimate=c(sigma2w, sigma2b, sigma2tot), lwr=c(withinLCB, betweenLCB, totalLCB), upr=c(withinUCB, betweenUCB, totalUCB)) rownames(out) <- c("within", "between", "total") return(out) }
with(dat1, confintVC(y, group)) ## estimate lwr upr ## within 0.9678172 0.5281232 2.318259 ## between 2.9238944 0.6600113 121.680179 ## total 3.8917116 1.6132104 122.670629 with(dat2, confintVC(y, group)) ## estimate lwr upr ## within 1.1487720 0.6268675 2.751709 ## between 0.1630157 -0.2117912 13.789360 ## total 1.3117877 0.8059042 15.023651
The same estimates are provided by
lmer(y~group)
with thelme4
package for mixed-models.Bayesian method
First, let us construct the simulations of the total variance:
sims[, "sigma2tot":=sigma2w+sigma2b]
Let???s see the estimates before looking at the credible intervals:
sapply(sims["data1", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) c(mean=mean(x), median=median(x))) ## sigma2w sigma2b sigma2tot ## mean 1.116632 40.565856 41.682488 ## median 1.012645 4.268724 5.408561
sapply(sims["data2", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) c(mean=mean(x), median=median(x))) ## sigma2w sigma2b sigma2tot ## mean 1.248280 6.1369897 7.385270 ## median 1.144814 0.4706216 1.794074
Clearly, the median is a better estimate than the mean. Now let???s have a look at the credible intervals.
For the first dataset, the confidence interval about the within-variance is the same as the frequentist confidence interval, as expected, and it is interesting to see that the credible interval is close to the frequentist confidence interval for the other variances:
sapply(sims["data1", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) quantile(x, c(2.5,97.5)/100)) ## sigma2w sigma2b sigma2tot ## 2.5% 0.5282047 0.6471158 1.636684 ## 97.5% 2.3164339 121.5854008 122.691925
For the second dataset, it is not expected to get results close to the frequentist results. We get a shorter interval for the within-variance, and wider intervals for the other variances:
sapply(sims["data2", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) quantile(x, c(2.5,97.5)/100)) ## sigma2w sigma2b sigma2tot ## 2.5% 0.6109852 0.01482579 0.810674 ## 97.5% 2.4954109 17.13539103 18.486391
Confidence interval about coefficient of total variation
Frequentist method
A confidence interval about the coefficient of total variation \(\dfrac{\sqrt{\sigma^2_w+\sigma^2_b}}{\mu}\) is provided by the function
ranovaCV
below. The method is based on the generalized \(p\)-value approach using a generalized pivotal quantity. The procedure performed by this function is an original one: I did it myself, but I don???t really know how: I have naively followed a procedure given in Krishnamoorthy & Mathew???s book for another parameter of interest.ranovaCV <- function(y, group, level=0.95, nsims=100000){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw Z <- rnorm(nsims) Ub <- rchisq(nsims, I-1) Ue <- rchisq(nsims, I*(J-1)) # ! J*(I-1) in K & M's book is an error mu.gpq <- mean(y)-Z/sqrt(Ub)*sqrt(SSb/I/J) sigmatot.gpq <- 1/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) G <- sigmatot.gpq/mu.gpq low <- as.numeric(quantile(G,(1-level)/2)) # borne inf up <- as.numeric(quantile(G,level+(1-level)/2)) # borne sup return(list(lwr=low, upr=up)) }
with(dat1, ranovaCV(y, group)) # true value : sqrt(5)/10 ??? 0.224 ## $lwr ## [1] 0.1403256 ## ## $upr ## [1] 1.488213 with(dat2, ranovaCV(y, group)) # true value : sqrt(2)/10 ??? 0.1414 ## $lwr ## [1] 0.08956391 ## ## $upr ## [1] 0.3931898
Bayesian method
Similarly to any other parameter of interest, one firstly constructs the simulations of the coefficient of total variation:
sims[, "CV":=sqrt(sigma2tot)/mu]
and then one takes the quantiles:
sims[, list(lwr=quantile(CV, 2.5/100), upr=quantile(CV, 97.5/100)), by="data"] ## data lwr upr ## 1: data1 0.14091359 1.4463289 ## 2: data2 0.09033098 0.4414735
Prediction interval
Frequentist method
This method is given in Lin & Liao. Denote by \(y^{\text{new}}\) a new observation. Note that \[ y^{\text{new}} - \bar{y}_{\bullet\bullet} \sim {\cal N}\left(0, \sigma^2_{\text{tot}} + \frac{\tau^2}{I}\right),\] and the variance of \((y^{\text{new}} - \bar{y}_{\bullet\bullet})\) can be written
\[ \left(1+\frac{1}{I}\right)\tau^2 + \left(1-\frac{1}{J}\right)\sigma^2_w. \] Then knowing that \(SS_w \sim \sigma^2_w\chi^2_{I(J-1)}\) and \(SS_b \sim J\tau^2\chi^2_{I-1}\), an unbiased estimate of the variance of \((y^{\text{new}} - \bar{y}_{\bullet\bullet})\) is \[ \left(1+\frac{1}{I}\right)\frac{SS_b}{J(I-1)} + \left(1-\frac{1}{J}\right)\frac{SS_w}{I(J-1)} = \frac{I+1}{I(I-1)}\frac{SS_b}{J} + \frac{SS_w}{IJ} \] This estimate does not follow a Chi-square distribution but we assimilate it as an approximate Chi-square distribution using Satterthwaite estimate. In addition, since we know (or as shown here) that \(SS_b\), \(SS_w\) and \((y^{\text{new}} - \bar{y}_{\bullet\bullet})\) are independent, \[ \dfrac{y^{\text{new}} - \bar{y}_{\bullet\bullet}}{\sqrt{\dfrac{I+1}{I(I-1)}\dfrac{SS_b}{J} + \dfrac{SS_w}{IJ}}} \approx \mathrm{t}_\nu, \] wherefrom we get a prediction interval about \(y^{\text{new}}\).ranovapred <- function(y, group, conf=0.95){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw a <- (1/J*(1+1/I))/(I-1) b <- 1/I/J v <- a*SSb+b*SSw # estimates the variance of (Ynew-Ybar) nu <- v^2/((a*SSb)^2/(I-1)+(b*SSw)^2/I/(J-1)) # Satterthwaite degrees of freedom alpha <- 1-conf bounds <- mean(y) + c(-1,1)*sqrt(v)*qt(1-alpha/2,nu) return(list(bounds=bounds, std.error=sqrt(v))) }
by(dat, dat$data, function(data){ with(data, ranovapred(y, group)$bounds) }) ## dat$data: dat1 ## [1] 1.388652 15.947355 ## -------------------------------------------------------- ## dat$data: dat2 ## [1] 7.289183 12.508177
Bayesian method
The Bayesian prediction interval is obtained by taking the quantiles of the posterior predictive distribution. One firstly simulates this distribution:
sims[, ynew:=rnorm(nrow(sims), mu, sqrt(sigma2tot))]
One takes the quantiles:
sims[, list(lwr=quantile(ynew, 2.5/100), upr=quantile(ynew, 97.5/100)), by="data"] ## data lwr upr ## 1: data1 -0.2360107 17.55792 ## 2: data2 6.0875652 13.72911
For both datasets, the Bayesian prediction interval is larger than the frequentist one.
One-sided tolerance interval
We provide methods to get lower and upper tolerance limits. Recall this is nothing but a lower/upper confidence bound about a lower/upper quantile, and we will take the lower/upper \(90\%\) quantile as an example. The theoretical values are:
# first dataset qnorm(c(10,90)/100, 10, sqrt(5)) ## [1] 7.134364 12.865636 # second dataset qnorm(c(10,90)/100, 10, sqrt(2)) ## [1] 8.187612 11.812388
Frequentist method
The
ranovatol
function provides Krishnamoorthy & Mathew???s tolerance limits given in Krishnamoorthy & Mathew???s book.ranovatol <- function(y, group, p=0.9, conf=0.95, nsims=50000){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw Z <- rnorm(nsims) Ub <- rchisq(nsims, I-1) Ue <- rchisq(nsims, I*(J-1)) # ! J*(I-1) in K & M's book is an error G <- mean(y)-Z/sqrt(Ub)*sqrt(SSb/I/J)-qnorm(p)/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) low <- as.numeric(quantile(G,1-conf)) G <- mean(y) + Z/sqrt(Ub)*sqrt(SSb/I/J) + qnorm(p)/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) up <- as.numeric(quantile(G,conf)) return(c(lower=low, upper=up)) }
by(dat, dat$data, function(data){ with(data, ranovatol(y, group)) }) ## dat$data: dat1 ## lower upper ## -2.355055 19.691062 ## -------------------------------------------------------- ## dat$data: dat2 ## lower upper ## 5.870706 13.926654
Bayesian method
We proceed as for any parameter of interest. First:
sims[, c("tol_lower","tol_upper"):=list(qnorm(.1, mu, sqrt(sigma2tot)), qnorm(.9, mu, sqrt(sigma2tot)))]
then:
sims[, list(lower=quantile(tol_lower, 5/100), upper=quantile(tol_upper, 95/100)), by="data"] ## data lower upper ## 1: data1 -2.242143 19.58247 ## 2: data2 5.534809 14.28729
What about the unbalanced situation ?
I have never tried, but replacing everywhere the sum of squares \(SS_w\) and \(SS_b\) by the unweighted sum of squares should give good results in the unbalanced case. This is known to work well for variance components (see Burdick & al).