In the preceding diary entry, I showed how to generate a confidence interval for the unbiased sample variance, and finally derived formulas that reduced all computations to an O(n) matter. However, it was somewhat a head-through-the-wall matter, so here I'll try to write down the entire affair in a clear and straightforward way.

To get a confidence interval for the unbiased sample variance, we need to estimate the variance of the sample variance. People have figured out formulas for the variance of the sample variance, for instance here. The correct formula is: $$ \mathbb{V}s^2 = \frac{\mu_4}{n} - \frac{(n-3)\sigma^4}{n(n-1)} $$ where $\mu_4 = \mathbb{E}(X - m)^4$ is the fourth central moment aka the (centralized but non-standardized) kurtosis, and $\sigma^4$ is the square of the variance of $X$. (By the way, there is an older reference to that, namely Cramer: Mathematical Methods of Statistics). So, what is the takeaway message from that formula? Well, on a large sample it is $(\mu_4 - \sigma^4)/n$ which is the surviving leading term if it is non-zero. That can indeed happen: for the coin toss, both $\mu_u$ and $\sigma^4$ are $1/16$. Then, the leading term is smaller. Let us expand both terms in raw moments because that will later allow us to write more efficient code (and personally, I like them more): $$ \mu_4 = E(X-m)^4 = EX^4 - 4mEX^3 + 6m^2EX^2 - 4 m^4 + m^4 = EX^4 - 4mEX^3 + 6m^2 EX^2 - 3m^4 $$ and $$ \sigma^4 = (EX^2 - m^2)^2 = (EX^2)^2 - 2m^2EX^2 + m^4 $$ To gain insight into the leading term, we could directly subtract one from the other but it does not immediately lead to terms cancelling out.

Anyway! What interests us is the question: What are the best estimators for these two ingredients, $\mu_4$ and $\sigma^4$?

To make a long story short, let me just give you the correct formulas. Ultimately, they come from the theory of symmetric polynomials. Denote by $\bar{x}$ the sample mean, and $$ y_j := \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^j $$ the usual version of the $j$-th empirical central sample moment. Caution: These are not unbiased estimators! The reason is that $x_i$ occurs in the sample mean, so that there are cross-terms whose expectations skrews a bias in.

As for the other ingredient $\sigma^4$, it turns out that $$ \biggl(1 + \frac{1}{2(n-1)} + \frac{5}{2(n-2)} + \frac{9}{2(n-2)(n-3)}\biggr)y_2^2 - \biggl(\frac{1}{n-2} + \frac{3}{(n-2)(n-3)}\biggr) y_4 $$ is indeed an unbiased estimator of our $\sigma^4$, the square of the population variance.

So, we can plug these two into the formula for $\mathbb{V}s^2$, and we will arrive at an unbiased estimator for that quantity. Let's do that! The result is: $$ \frac{2}{n - 2} + \frac{3}{2(n - 1)} + \frac{1}{(n - 1)^2} - \frac{9}{2(n - 3)} $$ as the coefficient of $y_2^2$, and $$ (3/(n - 3) - 2/(n - 2)) $$ as the coefficient of $y_4$. There we are! The resulting linear combination is the least-squares unbiased estimator of the variance of the unbiased sample variance.

Just one last polishing can be done to streamline the direct implementation in R: Instead of using the biased empirical second moment $y_2$, we might as well use its unbiased sister, the unbiased sample variance $s^2 = ny_2/(n-1)$ and plug $y_2 = (n-1)s^2/n$ into our formula, thus multiply the coefficient with $(n-1)^2/n^2$: $$ \biggl( \frac{2}{n - 2} + \frac{3}{2(n - 1)} + \frac{1}{(n - 1)^2} - \frac{9}{2(n - 3)}\biggr) \cdot \frac{(n-1)^2}{n^2} = \biggl(\frac{1}{2(n-2)} + \frac{1}{2n} - \frac{2}{n-3}\biggr) $$ The coefficient of $y_4$ remains as it is, there is not much to be done about $y_4$. That's it! This is the most straightforward formula for the unbiased estimator of the variance of the unbiased sample variance. So, to put both terms together, we arrive at the formula $$ \biggl(\frac{1}{2(n-2)} + \frac{1}{2n} - \frac{2}{n-3}\biggr)(s^2)^2 + \biggl(\frac{3}{n-3} - \frac{2}{n-2}\biggr)y_4 $$ for the unbiased estimator of the variance of the unbiased sample variance. This is actually the formula that is used in the R-package. (The R-package contains a file with a few lines that empirically checks this formula.) To round it off, here's a quick plausability check: For large $n$, the estimation target is $(\mu_4 - \sigma^4)/n$, as noted above. The expression $(s^2)^2$ is asymptotically equal to $y_2^2$ which asymptotically is an estimator of $\sigma^4$. Its coefficient tends to $-1/n$ plus higher order terms. On the other hand, the expression $y_4$ is asymptotically an estimator of $\mu_4$, and its coefficient is $1/n$ plus higher order terms. So, their sum does indeed estimate $(\mu_4 - \sigma^4)/n$ asymptotically. Yikes!

As described in the previous diary entry, this gives a confidence interval for the population variance, by surrounding the usual sample variance with the magic number (the t distribution fractile) times the square root of the variance discussed in this entry. There is much more to say about all this, it is an exciting topic!

To get a confidence interval for the unbiased sample variance, we need to estimate the variance of the sample variance. People have figured out formulas for the variance of the sample variance, for instance here. The correct formula is: $$ \mathbb{V}s^2 = \frac{\mu_4}{n} - \frac{(n-3)\sigma^4}{n(n-1)} $$ where $\mu_4 = \mathbb{E}(X - m)^4$ is the fourth central moment aka the (centralized but non-standardized) kurtosis, and $\sigma^4$ is the square of the variance of $X$. (By the way, there is an older reference to that, namely Cramer: Mathematical Methods of Statistics). So, what is the takeaway message from that formula? Well, on a large sample it is $(\mu_4 - \sigma^4)/n$ which is the surviving leading term if it is non-zero. That can indeed happen: for the coin toss, both $\mu_u$ and $\sigma^4$ are $1/16$. Then, the leading term is smaller. Let us expand both terms in raw moments because that will later allow us to write more efficient code (and personally, I like them more): $$ \mu_4 = E(X-m)^4 = EX^4 - 4mEX^3 + 6m^2EX^2 - 4 m^4 + m^4 = EX^4 - 4mEX^3 + 6m^2 EX^2 - 3m^4 $$ and $$ \sigma^4 = (EX^2 - m^2)^2 = (EX^2)^2 - 2m^2EX^2 + m^4 $$ To gain insight into the leading term, we could directly subtract one from the other but it does not immediately lead to terms cancelling out.

Anyway! What interests us is the question: What are the best estimators for these two ingredients, $\mu_4$ and $\sigma^4$?

To make a long story short, let me just give you the correct formulas. Ultimately, they come from the theory of symmetric polynomials. Denote by $\bar{x}$ the sample mean, and $$ y_j := \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^j $$ the usual version of the $j$-th empirical central sample moment. Caution: These are not unbiased estimators! The reason is that $x_i$ occurs in the sample mean, so that there are cross-terms whose expectations skrews a bias in.

*But*it turns out that the expression $$ \biggl(\frac{3}{2(n-1)} + \frac{6}{n-2} - \frac{27}{2(n-3)}\biggr)y_2^2 + \biggl(1+\frac{1}{n-1}- \frac{6}{n-2}+\frac{9}{n-3}\biggr) y_4 $$ is indeed an unbiased estimator of $\mu_4 = \mathbb{E}(X -\mu)^4$ for all $n\ge 4$. As $n$ grows large, this tends to $y_4$, which fits to our intuitive understanding that $y_4$ estimates $\mu_4$.As for the other ingredient $\sigma^4$, it turns out that $$ \biggl(1 + \frac{1}{2(n-1)} + \frac{5}{2(n-2)} + \frac{9}{2(n-2)(n-3)}\biggr)y_2^2 - \biggl(\frac{1}{n-2} + \frac{3}{(n-2)(n-3)}\biggr) y_4 $$ is indeed an unbiased estimator of our $\sigma^4$, the square of the population variance.

So, we can plug these two into the formula for $\mathbb{V}s^2$, and we will arrive at an unbiased estimator for that quantity. Let's do that! The result is: $$ \frac{2}{n - 2} + \frac{3}{2(n - 1)} + \frac{1}{(n - 1)^2} - \frac{9}{2(n - 3)} $$ as the coefficient of $y_2^2$, and $$ (3/(n - 3) - 2/(n - 2)) $$ as the coefficient of $y_4$. There we are! The resulting linear combination is the least-squares unbiased estimator of the variance of the unbiased sample variance.

Just one last polishing can be done to streamline the direct implementation in R: Instead of using the biased empirical second moment $y_2$, we might as well use its unbiased sister, the unbiased sample variance $s^2 = ny_2/(n-1)$ and plug $y_2 = (n-1)s^2/n$ into our formula, thus multiply the coefficient with $(n-1)^2/n^2$: $$ \biggl( \frac{2}{n - 2} + \frac{3}{2(n - 1)} + \frac{1}{(n - 1)^2} - \frac{9}{2(n - 3)}\biggr) \cdot \frac{(n-1)^2}{n^2} = \biggl(\frac{1}{2(n-2)} + \frac{1}{2n} - \frac{2}{n-3}\biggr) $$ The coefficient of $y_4$ remains as it is, there is not much to be done about $y_4$. That's it! This is the most straightforward formula for the unbiased estimator of the variance of the unbiased sample variance. So, to put both terms together, we arrive at the formula $$ \biggl(\frac{1}{2(n-2)} + \frac{1}{2n} - \frac{2}{n-3}\biggr)(s^2)^2 + \biggl(\frac{3}{n-3} - \frac{2}{n-2}\biggr)y_4 $$ for the unbiased estimator of the variance of the unbiased sample variance. This is actually the formula that is used in the R-package. (The R-package contains a file with a few lines that empirically checks this formula.) To round it off, here's a quick plausability check: For large $n$, the estimation target is $(\mu_4 - \sigma^4)/n$, as noted above. The expression $(s^2)^2$ is asymptotically equal to $y_2^2$ which asymptotically is an estimator of $\sigma^4$. Its coefficient tends to $-1/n$ plus higher order terms. On the other hand, the expression $y_4$ is asymptotically an estimator of $\mu_4$, and its coefficient is $1/n$ plus higher order terms. So, their sum does indeed estimate $(\mu_4 - \sigma^4)/n$ asymptotically. Yikes!

As described in the previous diary entry, this gives a confidence interval for the population variance, by surrounding the usual sample variance with the magic number (the t distribution fractile) times the square root of the variance discussed in this entry. There is much more to say about all this, it is an exciting topic!