Simple explanation of meta-analysis; below is a copy of my attempt to explain basic meta-analysis on the DNB ML. I thought I might reuse it elsewhere, and I'd like to know whether it really is a good explanation or needs fixing.
---
Hm, I don't really know of any such explanation; there's Wikipedia, of course: http://en.wikipedia.org/wiki/Meta-analysis
Meta-analyses usually presume you know what an 'effect size' is. This is different from stuff like p-values. A sort of summary is that p-values say whether there is a difference between the control and experiment, while effect sizes say how big the difference is.
Each study gives you an effect size, based on the averages and standard deviation (how variable or jumpy the data is). What do you do with 10 effect sizes? How do you combine or add or aggregate them? That's where meta-analysis comes in.
Well, you could just treat each as a vote: if 6 of the effect sizes are positive, and 4 are negative, then declare victory. There's an effect of X size.
But what if some of the effects are huge, like 0.9, and all the others are 0.1? If we just vote, we get 0.1 since that's the majority. But is 0.1 really the right answer here? Doesn't seem like it.
So instead of voting, let's average! We add up the 10 studies and get something like +5; then divide by 10 and get 0.5 as our estimate. Much more reasonable: 0.9 seems too high like they may be outliers, but 0.1 is kind of weird since we did get some 0.9s; split the difference.
But studies don't always have the same number of subjects, and as we all know, the more subjects or data you have, the better an estimate you have of the true value. A study with 10 students in it is worth much less than a study which used 10,000 students! A simple average ignores this truth.
So let's weight each effect size by how many subjects/datapoints it had in it: the effect size from the study with 10 students is much smaller\* than the one from 10,000 students. So now if the first 9 studies have ~10 datapoints, and the 10th study has 1000 datapoints, those 9 count as, say, 1/10th\* the last study since they totaled ~100 to its 1,000.
So each effect size gets weighted by how many datapoints went into making it, and then they're essentially averaged together to give One Effect Size To Rule Them All.
Then you can start looking at other questions like confidence intervals (this One Effect Size is not exactly right, of course, but how far away is it from the true effect size?), heterogeneity (are we comparing apples and apples? or did we include some oranges), or biases (funnel plots and trim-and-fill: does it look like some studies are missing?)
In the case of the [DNB meta-analysis](http://www.gwern.net/DNB%20FAQ#meta-analysis), we can look at the One Effect Size over all studies which was something like 0.5. But some studies are high and some are low; is there any way to predict which are high and low? Is there some characteristic that might cause the effect sizes to be high or low? I suspected that there was: the methodological critique of active vs passive control groups. (I actually suspected this before the Melby meta-analysis came out, which did the same thing over a larger selection of WM-related studies.)
So I subcategorize the effect sizes from active control groups and the ones with passive control groups, and I do 2 smaller separate meta-analyses on each category. Did the 2 smaller meta-analyses spit out roughly the same answer as the full meta-analysis? No, they did not! They spat out quite different answers: studies with passive control groups found that the effect size was large, and studies with active control groups found that the effect size was small. This serves as very good evidence that yes, the critique is right, since it's not that likely that a random split of studies would separate them so nicely.
And that's the meat of my meta-analysis. I hope this was helpful?
\* how much smaller? Well, that's where statistics comes in. It's not a simple linear sort of thing: 100 subjects is not 10x better than 10 subjects, but less than 10x better. Diminishing returns.
---
Hm, I don't really know of any such explanation; there's Wikipedia, of course: http://en.wikipedia.org/wiki/Meta-analysis
Meta-analyses usually presume you know what an 'effect size' is. This is different from stuff like p-values. A sort of summary is that p-values say whether there is a difference between the control and experiment, while effect sizes say how big the difference is.
Each study gives you an effect size, based on the averages and standard deviation (how variable or jumpy the data is). What do you do with 10 effect sizes? How do you combine or add or aggregate them? That's where meta-analysis comes in.
Well, you could just treat each as a vote: if 6 of the effect sizes are positive, and 4 are negative, then declare victory. There's an effect of X size.
But what if some of the effects are huge, like 0.9, and all the others are 0.1? If we just vote, we get 0.1 since that's the majority. But is 0.1 really the right answer here? Doesn't seem like it.
So instead of voting, let's average! We add up the 10 studies and get something like +5; then divide by 10 and get 0.5 as our estimate. Much more reasonable: 0.9 seems too high like they may be outliers, but 0.1 is kind of weird since we did get some 0.9s; split the difference.
But studies don't always have the same number of subjects, and as we all know, the more subjects or data you have, the better an estimate you have of the true value. A study with 10 students in it is worth much less than a study which used 10,000 students! A simple average ignores this truth.
So let's weight each effect size by how many subjects/datapoints it had in it: the effect size from the study with 10 students is much smaller\* than the one from 10,000 students. So now if the first 9 studies have ~10 datapoints, and the 10th study has 1000 datapoints, those 9 count as, say, 1/10th\* the last study since they totaled ~100 to its 1,000.
So each effect size gets weighted by how many datapoints went into making it, and then they're essentially averaged together to give One Effect Size To Rule Them All.
Then you can start looking at other questions like confidence intervals (this One Effect Size is not exactly right, of course, but how far away is it from the true effect size?), heterogeneity (are we comparing apples and apples? or did we include some oranges), or biases (funnel plots and trim-and-fill: does it look like some studies are missing?)
In the case of the [DNB meta-analysis](http://www.gwern.net/DNB%20FAQ#meta-analysis), we can look at the One Effect Size over all studies which was something like 0.5. But some studies are high and some are low; is there any way to predict which are high and low? Is there some characteristic that might cause the effect sizes to be high or low? I suspected that there was: the methodological critique of active vs passive control groups. (I actually suspected this before the Melby meta-analysis came out, which did the same thing over a larger selection of WM-related studies.)
So I subcategorize the effect sizes from active control groups and the ones with passive control groups, and I do 2 smaller separate meta-analyses on each category. Did the 2 smaller meta-analyses spit out roughly the same answer as the full meta-analysis? No, they did not! They spat out quite different answers: studies with passive control groups found that the effect size was large, and studies with active control groups found that the effect size was small. This serves as very good evidence that yes, the critique is right, since it's not that likely that a random split of studies would separate them so nicely.
And that's the meat of my meta-analysis. I hope this was helpful?
\* how much smaller? Well, that's where statistics comes in. It's not a simple linear sort of thing: 100 subjects is not 10x better than 10 subjects, but less than 10x better. Diminishing returns.
For the intended audience, I wonder if this might be too simplistic? I suggest compressing the first 8 paragraphs into 1 that presents the "naive" approach of just adding together all numerators and all denominators. Perhaps include a plausible scenario with made-up numbers for 2 or 3 studies, so you can give explicit results. (Or use your own meta-study as an example throughout.)
Then your value-added could be detailing the ways to be not naive. You could categorize them by practicality, degree of automation (e.g., outlier analysis vs. your control group comparison), and degree of judgement call (which candidate papers to include or exclude; treatment of file-drawer effect).
I wished the footnote had a link to a longer explanation of non-linear weighting; and more links in general would be helpful.Aug 29, 2012
+Michael O'Kelly I've found that by and large, there's no such thing as 'too simplistic'. By aiming as low as you can, you reach more of your audience and really test your own understanding. http://lesswrong.com/lw/kh/explainers_shoot_high_aim_low/ comes to mind.
Ways not to be naive: the problem is, I want to explain meta-analysis at its core, which is just ways to aggregate the effect-size numbers, but things like moderator or selection criteria are orthogonal.
Although I can give a better explanation of the weighting: for independent n, the weighting is basically a square root. http://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_.28Bienaym.C3.A9_formula.29
You get the variance of the ns' average is: variance-of-the-ns^2/n; you want the true average of the ns, so this variance is considered error - how far away from the true average your average is. To undo the ^2, we need a square root. So the error shrinks with the root of n. Hence, if _n_=1000 vs _n_=100, that's 31 vs 10, or 3x better and not 10x better.
Different tests shrink at different rates depending on what you ask. If we fire up R and look at t-tests, we get a factor of 6 improvement on 'power':
> pwr.t.test(d=0.1, n=100); pwr.t.test(d=0.1, n=1000)
Two-sample t test power calculation
n = 100
sig.level = 0.05
power = 0.1083718
Two-sample t test power calculation
n = 1000
sig.level = 0.05
power = 0.6083667
So with _n_=100, we have a 11% chance of significance, while _n_=1000 increases that 6-fold to 61%.Aug 31, 2012