Shared publicly  - 
 
A primer for how not to respond when someone fails to replicate your work
with a discussion of why replication failures happen

In the linked post, John Bargh responds to a paper published in PLoS ONE that failed to replicate his finding that priming people with terms related to aging led them to walk more slowly to the elevator afterward. His post is a case study of what NOT to do when someone fails to replicate one of your findings.


Replication failures happen. In fact, they should happen some of the time even if the effect is real and the replication attempt was conducted exactly like the original study. For any effect, especially small ones, you would expect some failures to replicate. Failures to replicate could occur for all of the following reasons (and maybe others):

(1) chance -- the effect is real, but this particular test of it didn't find the effect. With small effects, you expect some percentage of exact replication attempts to fail to find effects as big as the original. Remember, measurements of behavior are inherently noisy, and it's rare to find exactly the same effect size every time. In fact, finding exactly the same effect every time is a sign of bias (and sometimes a sign of fraud.)

(2) seemingly arbitrary design differences that contributed to the discrepancy -- these can be informative, helping to constrain the generalizability of the conclusions. They are grounds for further studies.

(3) poor methodology on the part of those trying to replicate the study—it's easy to produce a null result by conducting shoddy research. In this account, the failure to replicate is a false negative due to poor design, not to subtle but reasonable design differences.

(4) poor methodology in the original research—the original finding was a false positive due to poor controls and design. False positives could also result from design or analysis decisions that lead to reporting only of significant findings or variants of a study.

5) Chance, but for the original finding. Some published effects might be false positives, even if they were conducted competently. That's especially true for underpowered studies.

Given the strong bias to publish only positive results (see work by Ionnides, for example), an original false positive seems at least as likely as a false negative, especially when there are few if any direct replications of a published result. Given the difficulty of publishing replication failures, it's important to realize that there might be other failures to replicate that were not published (see the comment from +Alex Holcombe on the post, noting another failure to replicate this particular study).

Rather than dispassionately considering all of these possibilities, including that the original research might be a false positive, Bargh chose to:
1) Dismiss the journal in which the replication failure was published in an uninformed way. Bargh claims that PLoS One is a for profit journal that is effectively a vanity press with a pay-to-publish model that doesn't do a thorough peer review and doesn't rely on expert editors. In reality, PLoS One is a non-profit organization that selects expert editors, reviews papers just like any other journal, and never rejects a paper because authors can't pay (they waive the fee upon request). It is one of the fastest growing open-access journals, and has a roughly 30% rejection rate. It differs from other journals in that it publishes empirically solid work regardless of the perceived theoretical impact.

2) Accuse the authors of the critique of incompetence with an unjustified, ad-hominem attack. This group of authors has extensive expertise in consciousness research. For example, Cleeremans was an editor of the Oxford Companion to Consciousness and is well-respected in the field.

3) Describe method details from the replication attempt (and the original study) inaccurately. See the first comment on the post for a detailed discussion.

4) Reveal a lack of familiarity with science blogging by blaming one of the most careful and thoughtful science writers working today (+Ed Yong) for publicizing the replication failure. It seems somewhat disingenuous to fault Ed for "swallowing their conclusions whole" after refusing to respond to a request for comment he sent several days before the post went live.

An effective response to a failure to replicate would be to identify ways in which the studies differed and then to test whether those differences explain the discrepancy. Acknowledging that replication failures happen and pushing for more direct replication attempts rather than just conceptual ones might help too. But, assuming that a failure to replicate must have been due to incompetence or the shoddy standards of a journal is a pretty brash response.

The comments on Bargh's post (from +Ed Yong, Neuroskeptic, +Peter Binfield, +Alex Holcombe, and many others) are interesting and informative, an example of how the science bloggers work to correct the record.
29
19
Daniel Simons's profile photoAlex Holcombe's profile photoArt Markman's profile photoJazi Zilber's profile photo
49 comments
 
Okay, clearly the subtlety is lost on me. Why is Bargh's post an example of what not to do?
 
A fascinating set of comments to this post on the PT site as well.

I am surprised that John took off after PLoS One.

It is a huge pain to publish failed replications, even for studies that are widely discussed. Journal editors hate them, because the failed replications rarely get cited. If they are successful, their best outcome is that the cut down on submissions to the original study as well as to the failed replication. They are like anti-matter in that way.
 
+Peter Gunn He called the journal crap, asserted the researchers must have screwed up, and even complained about people who tweeted about the study. Nothing subtle about it. It's a classic case study of what not to do.
Ed Yong
+
1
2
1
 
Thanks for the vote of confidence, Daniel.
 
+Ed Yong - absolutely appropriate. I've had these sorts of tirades directed at me from time to time. It doesn't feel good. But, from a PR perspective, they're almost always worse for the tirader than the tiradee.

Just to point out to those who didn't read +Ed Yong's original article and the comments on it: On those rare occasions when +Ed Yong makes a mistake, he admits it and corrects it publicly. That's what good journalists, scientists, and writers do. Unlike most science writers these days, Ed seeks comments from all interested parties in advance and tries to communicate the reasons for disagreements. How he finds the time to do that given his productivity, I'll never understand...
 
It doesn't seem like a particularly difficult study to conduct, probably the best response would have been to provide a positive replication.

Pashler couldn't replicate either: Pashler, H., Harris, C., & Coburn, N.. Elderly-Related Words Prime Slow Walking . (2011, September 15). Retrieved 19:05, March 08, 2012 from http://www.PsychFileDrawer.org/replication.php?attempt=MTU%3D
 
+Walter Boot -- I agree. What's really surprising is that, as best I can tell, nobody has ever published a direct replication of this study. It has been cited hundreds of times, and it isn't that hard to do. You'd think someone in social would have tried a direct replication before extending it to something else. The fact that Pashler failed to replicate as well makes me wonder if people have tried to replicate before... (I'd love to see the fireworks if Bargh tried to suggest that Pashler wasn't a competent experimenter!)
 
If I had to bet, there are a lot of failed replications in file drawers on this one. It is too easy to run and too attractive to run for there not to be a number of versions out there. I have no knowledge of that, though. Just a feeling.
 
+Art Markman -- If a lot of people have tried and failed, I'd think there'd be some buzz about that in the social psych community. It seems like most people in a field know when something doesn't really hold up (at least that's true in the vision sciences where people run home to mock up the displays right after seeing them at a conference, and people can experience many of the effects for themselves). I haven't heard such buzz about this study, but I also might not be privy to it given that I'm not a social psychologist.

Any social psychologists out there know if there's any "common understanding" of this sort about this study? If you have or haven't heard any such water cooler talk, I think a lot of psychologists might like to know. If you're not comfortable posting what would amount to hearsay or rumors, please message me directly whether you have or haven't heard anything, and I'll compile and anonymize anything I hear and post an summary count. This would be an admittedly unscientific poll, but it seems important to know whether or not such attempts exist.
 
My sense is that the reason that there isn't more buzz is that the basic phenomenon of priming is sound. You can clearly use the Wyer/Srull unscrambling paradigm to prime stereotypes and mindsets. If the observation that people don't walk slower doesn't replicate, that negates a sexy finding, but it doesn't really change the broad view of the field. That may be why there isn't more buzz.
 
+Art Markman -- but the walking bit was the crucial finding. It showed that the priming had behavioral consequences, affecting actions via stereotype activation. On its own, the fact that you can activate stereotypes with an unscrambling task is far less interesting, especially given that most such studies do not adequately control for awareness (pretty much impossible to do with an explicit task like that, something folks in the implicit perception world have discussed for more than 30 years now). They might just be showing that if you do task that activates a stereotype, you activate the stereotype. What was remarkable about that Bargh et al study was that it led to a direct effect on actions that were associated with that stereotype. That's a level of spreading activation from a prime that is much more powerful than most shown in the cognitive literature. If that part turns out to be wrong more broadly, it puts a fairly large damper on that priming literature.
 
+Daniel Simons But, there are a lot of influences of that kind of priming on all kinds of behaviors. This is particularly true in decision making studies. And I do think that Bargh is right in his blog entry that the stereotype threat work is a close cousin. You can use these same priming tasks to get people to do worse on math tests. What was really attractive about the walking studies was that it was an unexpected bodily behavior that was influenced. I think there was a reason it was unexpected, and that is because it isn't really a reaction we have to thinking about getting older.
 
+Art Markman -- Agreed that stereotype threat is a close cousin, but it does seem fundamentally different in the way you say. In stereotype threat cases, people are thinking about their identity, and that leads to different performance. In this case, and other cases of implicit goal priming, people purportedly aren't thinking about it and don't even know they've been primed, and it has a direct effect on their actions. That strikes me as different in a fundamental way.
 
+Daniel Simons I don't think that people are overtly thinking about the stereotype in stereotype threat studies. We have done a few of these studies in my lab (Grimm, et al., 2009, JPSP), and it is pretty clear that the subjects do not know what the study is about. Most of these studies make a particular dimensions salient, but the subjects aren't really that focused on the stereotype dimension throughout the study.
 
+Art Markman -- Right, but they are aware that it's salient. In the Bargh study, the claim is that they were completely unaware of the prime. It was unconscious spreading activation of some sort. In stereotype threat, stereotype is made salient in a way that people are aware it was made salient. They might not figure out the hypothesized effect and they might not be thinking about it throughout the study, but it typically is an explicit manipulation, right?
 
+Daniel Simons I guess. In the first study of our paper, all we did was ask participants to enter their gender in box on the computer screen before starting the test. That is a pretty weak manipulation. We get asked to enter gender information on demographic forms all the time.

In the golf studies, people are explicitly told that the test is one of 'sports intelligence' or 'sports ability,' but race is made salient for White participants by having posters of African Americans hanging on the wall. Again, that is a pretty subtle manipulation. It is not obvious that people are really aware of that.
 
+Art Markman -- subtle, yes. Implicit, probably not. I'm not completely up on that literature and haven't read your papers on it (sorry:( - I'm curious if you had a check to see if the subjects were aware of the primes or thinking about them. I could imagine that if gender were the only demographic question, that might stand out as odd. And, there aren't too many psych labs with lots of posters of people of one race, so that might stand out too.
 
+Daniel Simons In the study I just described, we didn't ask for awareness, because it wasn't the focus of the study. I do agree that the stereotype threat studies are not identical to the unscrambling task studies.

My main point is that there seems to be a lot of evidence that you can prime all sorts of aspects of people's self-concept and get influences on behavior. The work by your UIUC colleagues CY Chiu and Ying-Yi Hong on biculturalism is another good example there. You can get influences on behavior and decision making by priming one or another cultural identity using broad cultural symbols.

The work on goal contagion is a similar kind of effect (Aarts, Gollwitzer, & Hassin, 2004). If you observe people being helpful or if you read about people being helpful, then you end up being helpful later in the study. People don't seem to be aware that what they saw or read about influenced their later behavior.

On my reading of all this work, there are lots of influences on people's behavior that they are unaware of. So, the main thing about Bargh's study was that it was a sexy dependent measure that was easy to describe to people. If it turns out that the result doesn't hold up, it doesn't change much of the empirical base of the field.
 
+Art Markman -- I think it's more than just a sexy dependent measure. Unlike all of the stereotype threat work, the stereotype wasn't about the subjects themselves. The subjects were college students, not elderly folks. The prime didn't activate something about how people viewed themselves. Instead, the argument was that it caused them to behave as if they were elderly, presumably because of some spreading activation of the stereotype to actions consistent with that stereotype, even though the stereotype doesn't apply to those subjects at all.

As a general rule, I tend to be highly skeptical of big effects from small manipulations. Maybe I'm overly skeptical. But, there are far too few direct replications in these literatures, and the appeal to conceptual replications leaves me somewhat unsatisfied. It makes me wonder how much of a file drawer problem might exist (just imagine trying to publish a failure to find an effect of stereotype threat...)
 
+Daniel Simons I think we're in general agreement on this. I wish there was more replication. I wish that there were fewer brief reports. On balance, I think the Psychological Science model has been bad for the field, because it has proliferated short papers that aim for big and counterintuitive findings. There are some (the change blindness studies are a great example of that), but they are few and far between.

I also agree that most of the priming studies are about factors that are probably true of people to some degree. Though, I think the goal contagion work is interesting in that regard. It suggests that people are adopting goals of other people that may not be things that are chronically active for themselves.

At any rate, I'd like to see more replication. And I think that sites like psychfiledrawer will play an important role in providing an outlet for failures to replicate that may help the field to track which findings are worth paying attention to.
 
+Art Markman -- I agree for the most part, but I disagree on the brief report point. The problem isn't short reports, it's underpowered studies. Using an adequate sample size to test an effect doesn't add to the length of the paper. The problem, as I see it, is the desire to publish and make claims about flashy, underpowered studies with small effects that have a high probability of being a false positive. It's easy to write a concise report, even with an added exact replication, with adequate power to test the hypothesis. I'd like to see journals like Psych Science be a little more skeptical about sensational findings from underpowered studies, so I think we agree that there have been some problems with that emphasis. But, I think the "short report" complaint is a red herring.

In some ways, I think the field has been hurt by having larger papers with massive introductions that review entire literatures for every empirical paper. Mature sciences don't do that. Look at the length of empirical papers in physics or biology—they're consistently shorter than psychology papers. I think we need more separation of empirical reports and literature reviews/overviews. There's no need for a 10 page introduction to an empirical paper.

Moreover, the danger of larger papers with many experiments is that they rarely involve direct replications, and they sometimes are an excuse to compile together a series of underpowered studies with small effects, giving the impression of solidity when it might not be justified. In those cases, I worry that we're only seeing the studies that happened to produce p<.05 results, and not all the variants that didn't. Greg Frances has a couple nice in-press papers showing just that sort of publication bias within individual, multi-experiment papers (e.g., Bem's paper, the verbal overshadowing literature, etc).

I'd be far more confident in the results of a short report with an adequately powered single experiment (ideally with an exact replication added in) than a long report with many underpowered conceptual "replications."
 
The issue with psychological science (and brief reports more generally) is not that they can't be done well. It is that they are not done well within the psychology community.

The focus on flashy results is a problem. It also seems to lead to some of the p-value fishing techniques that we all discussed a few months back. Those techniques can create the illusion of studies that have sufficient statistical power, but are in fact flawed.

I do agree that we should promote more direct replications of previous research.

There is a happy middle ground, though, between 2500 word papers that present one or perhaps two experiments and bloated papers with 5 pages of lit review before the experiments start. As a field though, we need to create a set of standards and expectations and stick with them. Right now, it is a free-for-all.
 
+Art Markman -- completely agree. Of course, the p-value fishing is just as bad in multi-experiment papers, and sometimes is worse (multi-experiment papers often have a series of poorly designed studies and argue for converging evidence. The convergence of crap is still crap, though).

I don't mind the focus on flashy/interesting results as long as they're robust ones. The real problem, in my view, is finding a way to avoid the poor methodological and statistical techniques that lead to over-publication of false positives (or to detect them before publication of fishy papers and to demand better methods before publication). If we do that, the short/long debate falls away, and we can focus on whether the studies were well-conducted, sufficiently powered, and replicable.
 
Agreed. One reason why I keep harping on the short/long distinction, though, is that it has created a CV arm's race. I'm seeing job candidates come on the market with a huge number of publications, but a lot of them are short reports. That creates pressure on people to try to pump out papers to get a job, which can lead people to cut corners without really thinking about the implications of the corners they cut.

Thanks for posting this article, though, the discussion has been really valuable for me. Sometimes it helps to articulate these vaguer feelings about what is going on in the field.
 
+Art Markman But the causal factor behind the "CV arms race" is a scarcity of jobs due to overproduction of PhDs for 30+ years. Come to think of it, this could be easily mitigated. Count pages, instead of (or in addition to) counting pubs and citations.
 
+Art Markman -- good point about the vita arms race. I think there is some bias against the short report too. I gathered that one of my letter writers for tenure thought less of my contributions because I don't write a lot of JEP papers. For me, at least, I think I put more thought into shorter papers. It's harder to write a compelling short paper. I often wonder, were I to put my original vita on the job market now, whether I would even get a look.

That said, the arms race is a real problem. I'd much rather see a few really thoughtful papers where the applicant was the clear lead author than a dozen for which they were 5th author. The big-lab everyone's an author model is another contributor to the arms race (a worse one, I think), and those papers are often long ones.

More generally, I wonder whether the relationship between publication pressure and short reports is causal in the direction you infer, or whether it's just a coincidental change in the field resulting from the proliferation of journals and greater online accessibility of articles.
 
As a much belated explanation of my remark, which was not meant sarcastically, understand that when I view your post in Google+, your expanded remarks, Dan, are entirely unavailable. Try as I might I can't get the elipses to expand to show them. I can expand the comments on your posting but not your posting itself.

The full posting is, however, visible in Google's Notifications menu, so I've now, finally read it.

Since your link to Bargh's essay was visible, I read it, and it came across as snarky but, to a complete outsider like myself, not as inflammatory as anyone more familiar with research methodology generally, and the specific journals and blogs in particular, clearly found it. As I said, I needed much more context -- which you'd actually provided in detail but which Google+ is inextricably hiding from me. Apologies for any confusion.

(If anyone would like to troubleshoot what I'm doing wrong in trying to read Dan's initial posting, I'd appreciate it. User interfaces are usually what I do well, but this problem is baffling to me.)
 
+Peter Gunn -- That's really bizarre (but explains a lot :-). I have no problem expanding posts on multiple browsers. By default, I use Chrome. I have a few plugins that I use occasionally that condense posts into list view, but I've never had the problem you're describing. Any G+ experts out there have this issue with expanding long posts?
 
As additional info (on the troubleshooting thread), on my browser (Chrome on a Mac running OSX 10.6.x) Dan's posting is clearly collapsed. I see the first three lines of his first paragraph followed by an ellipse, the link to the Bargh essay, then the options to +1, Comment, Hang out, and Share, then the stats on +1's, shares, and comments, and then the comments themselves. The comments do have a link to 'Expand this comment >>'

Frustrating.
 
+Peter Gunn -- one trick that might help: If you click on the time-stamp at the top of a post, it will take you to the permanent link for that post (control click to open in a new window). I believe that will be fully expanded by default. That might be a workaround until you can figure out the problem.
 
That does indeed do the trick. And might even be what Google is expecting me to do.
 
+Daniel Simons and +Art Markman : I agree with most of what you've written above. With respect to the lack of a large number of direct replications being published, a few points seem relevant:

1. It's not really much easier to publish a simple direct replication than it is to publish a failed replication. So the incentives might be just as low for a researcher to try to publish a successful replication as a failed replication.

2. The issue of what counts as a direct replication can sometimes be tricky. In our work (Cesario, Plaks, Higgins, 2006) we had a direct replication attempt set up for the elderly walking study, which included a control (no prime) condition and a youth prime condition. The means were all in the predicted directions, with elderly prime resulting in the slowest walking speed, youth prime resulting in the fastest speed, and control in the middle. Planned contrasts showed elderly and youth were sig. different from one another, but neither differed from the control. Does that count as a successful replication? Furthermore, although the elderly prime was not significantly different from control, the difference between those two might not have been different than the difference between Bargh's elderly and control prime condition. Does that count as a successful replication? It would toward a meta-analysis, but not if all you look at is the significance test on that one study.

3. W/respect to Art's comment that "the basic priming effect is sound," it really depends on the kind of priming effect you're talking about, does it not? For instance, with respect to the Higgins Rholes and Jones (1977) unobtrusive priming of trait words (which then influences impression formation), I replicate that effect in my undergrad social psych class every semester, and every semester the results are almost exactly the same in terms of the magnitude of the effect. It's extremely robust. But with priming of social categories and subsequent behavioral effects, it's not at all clear that that's so robust. In fact part of what we've been trying to do is identify some factors that can contribute to the context-sensitivity of priming social categories and measuring behavioral reactions--for example, whether people like or dislike the elderly (as assessed with an indirect measure) produces exact opposite effects of the elderly prime on walking speed, so the relative levels of participants with positive and negative attitudes in your sample could push around the effect quite substantially. Same with whether priming "black male" results in aggression--whether people feel trapped or not can influence that effect (as people are basically being primed with a threatening outgroup male and showing defensive threat responding). I'm not sure these same contingencies would apply for something like the Higgins Rholes and Jones effect. I guess I agree with Daniel that lumping every type of priming effect together is probably not reasonable.

In any event, Daniel, you are absolutely right that this is a textbook example of how not to respond to criticism.
 
+Joseph Cesario in starting to feel what
The issue is about. The basic idea of is robust, while the extremities are harder to prove + are way more sensitive to conditions etc.

The trend of Bargh etc. was to push the envelope in all directions and to show the most extreme examples of priming effects. It is intuitive that the more bizzar unconsconscious effects are much more dependent on fine tuning and optimizing each and every parameter of the manipulation.
These are almost expected to be successful to work only when everything is done "perfectly right", thus the failure to replicate is almost expected as well as the weaker significance values in your 3 consitions study, which is in my view a perfect replication of the effect. Because youth - old differences are the very same idea of old - control differences. (even though technically it is not strictly the very same measure)

I now see that a pair-wise comparison gives 0.02 for one of various measures done. so there is a borderline significance so to speak.
It would have been interesting if doing a stronger power experiment with the very same settings. But I feel a littel follish suggesting. but who knows. now with such a rage .... it has become a highly central research questions. (I am sure that now there will be many more replications. give a year or two
 
Thanks for the thoughtful comments, +Joseph Cesario. I'm not sure I agree that it's just as hard to publish a replication as a failure to replicate, especially if there are other studies in the package. It's easier for a hostile reviewer to explain away a null result. But, point taken that a straight replication doesn't excite reviewers either.

From my perspective, a direct replication is one in which all of the critical parameters and comparisons of conditions are the same. In that view, your study would not be a direct replication since you did not replicate the original comparison to a control condition (of course, it's really the effect size we're talking about, not the p value. Whether or not your comparison exceeded .05 is really not relevant because that's dependent on power. The question is whether your result produced the same effect size in the exact comparison that's relevant).

Assuming the effect size was not comparable to the one in the Bargh study, one problem with treating the old/young comparison in your study as a replication is that there are additional statistical tests implied, each of which could constitute a replication. Just as one example, had you found no difference between the control group and the youth group, but still found a difference between the control and old group, you still could have called that a replication. Even if all of the comparisons were ones you planned, you're still giving yourself multiple chances to "replicate" in that several different analyses would produce results of the same sort. That's not necessarily a bad thing, but in the worst of cases, it can lead to the sort of investigator degrees of freedom from multiple tests that inflate the significance of the comparisons (that Simmons and colleagues highlighted in Psych Science recently).

I think your third point is the crux of the problem. As soon as we move from direct replication to conceptual replication (or replication with many possible comparisons, any one of which would "count") we risk lumping together studies that don't really measure the same thing. The further danger is that, if you are always looking for moderator variables, it becomes possible to explain away any null result as a failure to identify the right moderators, when the original result could have been a false positive. If you keep trying over and over with different moderators, you'll eventually get significant results by chance. I think that happens more than we'd like to admit.
 
+Jazi Zilber -- I don't think it's unreasonable to ask for stronger power. Most of these original studies were somewhat underpowered given that the effects aren't large. Really, the measure of replication is not the p value, it's the effect size. With different sample sizes, the same effect would be non-significant (or significant). That's one reason why replication attempts typically should have enough power to detect a smaller effect than was found in the original study. More often than not, the first published result for a new sexy finding will have a spuriously large effect size, if for no other reason than the bias to publish significant results. The effect might be real, but a true replication of it should assume that it will be somewhat smaller, and the field shouldn't be surprised if the effect is weaker than the original result suggested.
 
+Daniel Simons What you've written is exactly right, and part of what I meant to point out--the reliance on p-values to determine replication success can be problematic. This causes me some degree of pessimism when thinking about replication, and how we respond to failures to replicate.

In the ideal world, we take the long view: over time we will have enough replication attempts that we can ask about effect size differences and, with meta-analysis, get a better sense of the effect.

In contrast, what often happens is exactly what we're seeing: some (one or two) replication attempts fail to achieve statistical significance and people (incl. commentators on science) immediately jump to "the effect wasn't real! ha! lies or experimenter effects were responsible!" Even though, in the long term, we might discover that those statistically non-significant findings contribute positively to the effect. I think the "psych file drawer" website, while generally a good and useful thing, contributes to this mentality by having, for instance, a "success vs. failure" table in which every replication attempt must be placed into one of these two categories based on the significance test, which allows people to quickly look at it and say, "5 failures to 1 success... I don't need to think about this any more, the effect isn't real." Again, it isn't inherently a bad thing but in my mind it allows people to take a short and narrow view of a finding.

The problem comes from both ends, of course: when people respond to failures to replicate their effects like how Bargh responded, this is just as counter-productive because it also is a short-term (defensive) response. At some point 10 years down the road he might be proven correct, but my guess is people will remember the aggressive and defensive response more.
 
+Joseph Cesario -- I completely agree. The emphasis on p values and significant/non-significant replications is problematic. It should be about the size of the effect (and the reliability of the effect size measurement, which depends on the sample size). I do think failures to replicate should perhaps be given more weight than an original publication, if only because of publication bias and the many ways that investigators can inflate the significance of their results through analytical decisions and underreporting. It seems to me that false positives are disproportionately more likely to appear in print, and we just don't see how many attempts went into each false positive.

That said, I agree with your concern about tallying successful and failed replications based on significance. +Alex Holcombe -- what do you think about adding a column to psychfiledrawer for the original effect size and the replication effect size? If that were required, and the effect size estimates standardized, that would go a long way to providing a meta-analytic, cumulative sense of how big the effect actually is.
 
+Daniel Simons I only meant that stronger power data reduces the noise element. Which is a central problem anyway.
Weakness of power, btw, is a source of very many problematic results and interpretations....

In a way, i do not care whether any study is true. I just want to know what effects are there......
 
+Joseph Cesario will there be many replaications long term? Skeptic about it. Unless a topic is creating variable consecutive research, things can carry on and get cited forever

The rare "no replication" publications are in highly cited and sometimes even old studies. Thus, there is no forcing corrective mechanism, and falsities may stay forever just because nobody happened to bother to refute them, nobody heard about the refutation, or just because it is telatively easy to believe in the known result.

Note that many " failure to replicate" discussions, are of " hard to beleive" findings whereas beleiveable findings get little testing
 
+Jazi Zilber re: Your comment "The rare "no replication" publications are in highly cited and sometimes even old studies"... I would love to see some statistics about the most highly-cited social/personality studies and the number of direct replications of each. One that immediately comes to mind is N. Schwarz & Clore's weather misattribution study, which is a classic and has been cited thousands of times but which, to the best of my knowledge, there has never been a direct replication.There are, of course, decades of research on misattribution and excitation transfer more generally, but I do not believe a direct replication of that effect in particular.

There is the interesting, and highly troublesome, possibility that there are large research areas which do not have direct replications but for which there are many conceptual replications, with a vast percentage of those conceptual replications obtained under questionable statistical practice (as +Daniel Simons notes in the post above--multiple comparisons and multiple moderator tests, etc... the 'researcher degrees of freedom' problem). In the behavioral priming literature, for instance, I wonder how many of the obtained effects have had direct replications. We point to the many publications showing various effects across social categories and behaviors; I have never done a survey determining how many of these have had direct replications.

Just to be clear, now that we've gotten this far into it, I feel compelled to note that I do not believe the Doyen logic makes any sense whatsoever for explaining priming effects. The failure to replicate may be noteworthy, but whatever is going on in priming it is most certainly not experimenter expectation.
 
I agree with the principle that effect sizes should be king. With regards to PsychFileDrawer.org however, the #1 goal is to make reporting non-replications and replications so quick and easy that people will do it despite the lack of career incentives to do so. And we still haven't entirely succeeded, in that we still don't have that many postings. Hopefully the greater attention the site is receiving thanks to this Bargh affair will increase postings, and the feeling people have that posting is worthwhile. However until we see real willingness for lots of people to spend time on postings, we have to resist the urge to add additional requirements. But you can help convince us that people will keep contributing by uploading some unpublished replications/nonreplications from your own lab :) People are asked to give a detailed description of the results, as Pashler did- he showed a graph with error bars of his result (which went in the opposite direction to the original Bargh result).
 
+ alex holcombe, the main barrier is accessibility to replications. In a dream of me, one would have in google scholar for every paper the list of replications, like the cited in list. Every paper published will contain beside references a list of "replication of".
 
+Joseph Cesario no data. But my impression is that people replicate when they have a new "interesting" variation to show, especially whne it contains a new " theoretical" concept.

Parameter research is unappealing. new theories are more "interesting"

Foe example, There is little done on optimizing exercise etc. ( how many minutes? Which degree? Etc.) even though optimization is very useful.

Also ego depletion. Very interesting and
Easy to manipulate experimentally. Hundreds of papers but almost none did parameter work ( individual differences, how long exactly does it take?) And other parameters that would have elucidate what exactly goes on.
Instead people only publish distinctly new ideas that do not go far to explain the main effect. You see million replications that
Are there en route to show the new theory


Btw, i remember seeing more studies on the effects of the order of questions in life satisfaction surveys, which support the schwarz study. A faint memory
 
+Alex Holcombe -- I agree that making it easy for researchers to add their studies is priority #1. Maybe an optional column for effect sizes? I worry that there just aren't that many direct replications...
 
Yes, ok, we're going to work on fitting something like that in :) I certainly don't want to be guilty of helping perpetuate psychology's fixation on p-values instead of effects!
Add a comment...