An interesting experiment: we worry about the 'garden of forking paths' in analysis where the results depend on the exact distributions, covariates, priors etc we choose, with unknown and hidden degrees of freedom; so generate a new virgin dataset and give it to a hundred analysts or so, ask them to answer the same question, and see how much they agree and whether they can come to a consensus. They can. Sort of.
http://andrewgelman.com/2014/04/28/crowdstorming-dataset/ https://osf.io/gvm2z/ https://osf.io/w6pi5/ http://home.uchicago.edu/~/npope/crowdsourcing_paper.pdf "Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?"
"29 teams involving 61 analysts used the same data set to address the same research questions: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players and whether this relation is moderated by measures of explicit and implicit bias in the referees' country of origin. Analytic approaches varied widely across teams. For the main research question, estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. 20 teams (69%) found a significant positive effect and 9 teams (31%) observed a non-significant relationship. The causal relationship however remains unclear. No team found a significant moderation between measures of bias of referees' country of origin and red card sanctionings of dark skin toned players. Crowdsourcing data analysis highlights the contingency of results on choices of analytic strategy, and increases identification of bias and error in data and analysis. Crowdsourcing analytics represents a new way of doing science; a data set is made publicly available and scientists at first analyze separately and then work together to reach a conclusion while making subjectivity and ambiguity transparent.
After the first round of reporting, the 29 teams of analysts reported results with highly varying effect sizes, and moderate consensus. After feedback rounds and discussions, teams submitted their final reports. Analytical strategies still varied, yet 69% of teams reported significant result and 78% of the researchers concluded that the dataset suggests a positive association.
Data Analysts.
Seventy-seven researchers expressed initial interest in participating and were given access to the Open Science Framework project page to obtain the data (https://osf.io/47tnc/). Individual analysts were welcome to form teams. Of the initial inquiries, 33 teams submitted a report in the first round, and 29 teams submitted a final report. In total, the project involved 61 data analysts plus the four authors who organized the project. Team leaders worked in 13 different countries and came from a variety of research backgrounds including Psychology, Statistics, Research Methods, Economics, Sociology, Linguistics, and Management. Of the 61 data analysts, 38 hold a PhD (62%) and 17 a Master's degree (28%). Researchers came from various ranks and included 8 Full Professors (13%), 9 Associate Professors (15%), 13 Assistant Professors (22%), 8 Post-Docs (13%) and 17 Doctoral students (28%). In addition, 27 participants (46%) have taught at least one undergraduate statistics course, 22 (37%) have taught at least one graduate statistics course, and 24 (39%) have published at least one methodological/statistical article.
Data set.
From a company for sports statistics, we obtained player demographics from all soccer players ( N = 2,053) playing in the first male divisions of England, Germany, France, and Spain in the 2012-2013 season. We also took from this source data about the interactions of those players with all referees ( N = 3,147) that they encountered in their professional career. Thus the data entails a period of multiple years from a player's first professional match until the date this data was acquired (June 2014). This data included the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee. The data set was made available as a list with 146,028 dyads of players and referees ( https://osf.io/47tnc/ ).
Players' photos were available from the source for 1,586 out of 2,053 players. Profiles for which no photo was available tended to be relatively new players or players who had just moved up from a team in a lower league. The variable player skin tone was coded by two independent raters blind to the research question who, based on the profile photo, categorized players on a 5-point scale ranging from 1 = very light skin to 5 = very dark skin with 3 = neither dark nor light skin as the center value ( r = 0.92; rho = 0.86) . This variable was re-scaled to be bounded by 0 ( very light skin ) and 1 ( very dark skin ) prior to the final analysis to ensure consistency among effect sizes between teams and to reflect the largest possible effect.
A range of potential independent variables was included in the data concerning the player, the referee, or the dyad. The complete codebook is available at: https://osf.io/9yh4x/. For players, data included their typical position, weight, and height at the time of data collection, and for referees, their country of origin. For each dyad, data included the number of games referees and players encountered each other and the number of yellow and red cards awarded. The variables of age, club, and league were only available for players at the time of data collection, not at the time of receiving the particular red card sanctioning.To protect their identities given the sensitivity of the research topic, referees were anonymized and listed by a numerical identifier for each referee and for each country of origin. For the country of each referee, we included average scores of implicit and explicit preferences for light vs. dark skin tone that had been gathered in independent research by Project Implicit (30, 31) . Implicit preference scores for each referee country had been calculated using a skin tone Implicit Association Test (IAT) (32) , a speeded response task that assesses strength of associations. Higher scores on the IAT reflect a stronger automatic association between dark skin, relative to light skin, and negative valence. Explicit preference scores for each referee country were calculated using a feeling thermometer task, with higher values corresponding to greater self-reported feelings of positivity toward light skin tone versus dark skin tone. Both these national-level measures were created by aggregating data from many online users from referees' countries taking these tests on Project Implicit ( https://implicit.harvard.edu/ ; see also (33) ).
At registration we asked team leaders for their present opinion regarding the research questions with a single question for each hypothesis, e.g. "How likely do you think it is 6 that soccer referees tend to give more red cards to dark skinned players?"
After removing description of the results, the structured summaries were collated into a single questionnaire and distributed to all the teams for peer review. The analytic approaches were presented in a random order and researchers were instructed to provide feedback on at least the first three approaches that they examined. Researchers were asked for both qualitative feedback as well as the assessment: "How confident are you that the described approach below is suitable for analyzing the research questions?", measured on a 7-point scale from 1= Unconfident to 7 = Confident (see S3). Each team received feedback from an average of about 5 other teams (M = 5.32, SD = 2.87). The qualitative and quantitative feedback was aggregated into a single report and shared with all team members. As such, each team received peer review commentaries about their own and other teams' analysis strategies. Notably, these commentaries came from reviewers that were highly familiar with the data set, yet at this point teams were unaware of others' results (see https://osf.io/evfts/ and https://osf.io/ic634/ for the complete survey and round-robin feedback)....When researchers scrutinized others' results, it became apparent that differences in results may have not only be due to variations in statistical models, but also due to variations in the choice of certain covariates. Doing a preliminary reanalysis, the leader of team 10 discovered that the controversial covariates league and country may be responsible for making some results appear non-significant. A debate emerged regarding whether the inclusion of these covariates was quantitatively defensible (see https://osf.io/2prib/ ). The project coordinators thus asked the 10 teams who had included these variables in their final models to re-run their models without said covariates. Additionally, we asked these teams to decide whether to keep their prior version or use the results from the updated analysis. The results displayed in the manuscript reflect teams' choices of their final model 1 .
From the 79 researchers who initially registered for the crowdstorming project, 33 teams were formed and submitted an initial analytical approach. Of those, 29 teams also submitted a final report. Submitted analytical approaches were diverse, ranging from simple linear regression techniques to complex multilevel regression techniques and Bayesian approaches. Table 1 shows each team's analytic technique, reported effect size, and a number of characteristics describing how their model was specified (e.g., the number of covariates used in the analysis). In total, there were 21 unique combinations of covariates among the 29 teams. Apart from the variable 'games', which was used by all teams, just one covariate (player position, 62%) was used in more than half of the analytic strategies and three were used in just one analysis. Two sets of covariates were used by three teams each, and four sets of covariates were used by two teams each. All other 15 teams used a combination of covariates, which only their own team used. Table 1 shows variation in analytic strategies for number of covariates (M = 2.83 Stdev = 2.05), treatment of the non-independent structure of the data, statistical distribution chosen for theoutcome, and reported effect sizes. More detail regarding specific covariates chosen by each team can be seen in Table 2. Reasons that teams gave for their initial inclusion/exclusion of particular covariates can be found at https://osf.io/sea6k/
.
For the primary research question, researchers' conclusions varied regarding whether or not soccer referees were more likely to give red cards to dark skin toned players than light skin toned players. Fig. 1 shows the effect sizes and 95% confidence intervals alongside the description of the analytic approach provided by each team. Statistical results ranged from 0.89 (slightly and non-significantly negative) to 2.93 (moderately positive) in odds ratio units 2 , with a median of 1.31. From a null hypothesis significance testing standpoint, twenty teams (69%) found a significant positive effect and nine teams (31%) observed a non-significant relationship. No team reported a significant negative relationship.
Overall, teams who employed logistic or Poisson models reported estimates that were somewhat larger than teams using linear models. More specifically, 15 teams used logistic models (11/15 significant, median OR = 1.34, MAD = 0.07), six teams used Poisson models (4/6 significant, median OR = 1.36, MAD = 0.08), six teams used linear models (3/6 significant, median OR = 1.21, MAD = 0.05), and two teams used models classified as miscellaneous (2/2 significant).
Teams also varied in their approaches to handling the non-independence of players and referees, which resulted in variability regarding both median estimated and rates of significance. In total, 15 teams used random effects (12/15 significant, Median OR = 1.32, MAD = 0.12), eight teams used clustered standard errors (4/8 significant, Median OR = 1.28, MAD = 0.13), five teams did not account for this artifact (4/5 significant, Median OR = 1.39, MAD = 0.28), and one team used fixed effects for the referee variable (0/1 significant, OR = 0.89).
After the discussion, and before seeing the draft of this report, most teams agreed moderately that the data showed a positive relationship between number of red cards and player skin-tone. In this final survey, a set of supplementary items assessing agreement with more nuanced beliefs (e.g., "There is little evidence for an effect," "The effect is positive and due to referee bias") revealed greatest endorsement (78% agreement) of the position that "The effect is positive and the mechanism is unknown" (M = 5.32, SD = 1.47 on a scale ranging from 1 = strongly disagree to 7 = strongly agree ; see S7 for more details).
Here, we demonstrate that variation in effect size is also present in the same data contingent on choices and assumptions in the analysis process. We observed variation in the effect estimates of whether soccer referees gave more red cards to dark skin toned players. We also observed convergence on the discrete judgment of whether there was a positive effect in the data. These crowdsourcing results illustrate both the contingency of effects as a function of analytic choices, and the opportunity for converging beliefs through shared examination and evaluation of a research question using a shared data set. The median result (OR = 1.31) indicated that the odds were 31% higher for players rated as having the darkest skin tone to receive a red card when compared to players rated as having the lightest skin tone. Assuming a 40 game season, these results suggest that the probability of receiving at least one red card over a season is 15.2% for a player with the darkest skin tone and 11.8% for a player with the lightest skin tone. 4 The estimated effects ranged from 0.88 to 2.93 in odds ratio units (1.0 indicates a null effect), with zero teams finding a negative effect, nine teams finding no relationship, and twenty teams finding a positive effect. If, as in virtually all other research projects, a single team had conducted the study, selecting randomly from the present teams, there would have been a 69% likelihood of reporting a positive result and a 31% likelihood of reporting a null effect.
Crowdsourcing of data analysis is inefficient in that numerous analysts conduct multiple rounds of data analysis to answer a single research question. But, consider that inefficiency in comparison to the status quo in which a research question is examined and reported using a single analysis strategy. Conventional practice makes little accommodation for the possible contingency of the results on the analytic method (2, 6) . Moreover, misspecification of results via analysis strategy is virtually undetectable without an ethic of open data and community review of analytic strategies. It is conceivable that the relative inefficiency trade-offs would actually produce a net benefit by having many independent analysts for a complex data set compared to the currently prevalent practice of individual analysis teams providing stand-alone analyses of privately held data. Further, the use of 29 independent teams helped us illustrate the variation in analytic strategies and conclusions, but - in practice - fewer independent teams may be needed to assess robustness of conclusions."
#statistics #replication
http://andrewgelman.com/2014/04/28/crowdstorming-dataset/ https://osf.io/gvm2z/ https://osf.io/w6pi5/ http://home.uchicago.edu/~/npope/crowdsourcing_paper.pdf "Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?"
"29 teams involving 61 analysts used the same data set to address the same research questions: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players and whether this relation is moderated by measures of explicit and implicit bias in the referees' country of origin. Analytic approaches varied widely across teams. For the main research question, estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. 20 teams (69%) found a significant positive effect and 9 teams (31%) observed a non-significant relationship. The causal relationship however remains unclear. No team found a significant moderation between measures of bias of referees' country of origin and red card sanctionings of dark skin toned players. Crowdsourcing data analysis highlights the contingency of results on choices of analytic strategy, and increases identification of bias and error in data and analysis. Crowdsourcing analytics represents a new way of doing science; a data set is made publicly available and scientists at first analyze separately and then work together to reach a conclusion while making subjectivity and ambiguity transparent.
After the first round of reporting, the 29 teams of analysts reported results with highly varying effect sizes, and moderate consensus. After feedback rounds and discussions, teams submitted their final reports. Analytical strategies still varied, yet 69% of teams reported significant result and 78% of the researchers concluded that the dataset suggests a positive association.
Data Analysts.
Seventy-seven researchers expressed initial interest in participating and were given access to the Open Science Framework project page to obtain the data (https://osf.io/47tnc/). Individual analysts were welcome to form teams. Of the initial inquiries, 33 teams submitted a report in the first round, and 29 teams submitted a final report. In total, the project involved 61 data analysts plus the four authors who organized the project. Team leaders worked in 13 different countries and came from a variety of research backgrounds including Psychology, Statistics, Research Methods, Economics, Sociology, Linguistics, and Management. Of the 61 data analysts, 38 hold a PhD (62%) and 17 a Master's degree (28%). Researchers came from various ranks and included 8 Full Professors (13%), 9 Associate Professors (15%), 13 Assistant Professors (22%), 8 Post-Docs (13%) and 17 Doctoral students (28%). In addition, 27 participants (46%) have taught at least one undergraduate statistics course, 22 (37%) have taught at least one graduate statistics course, and 24 (39%) have published at least one methodological/statistical article.
Data set.
From a company for sports statistics, we obtained player demographics from all soccer players ( N = 2,053) playing in the first male divisions of England, Germany, France, and Spain in the 2012-2013 season. We also took from this source data about the interactions of those players with all referees ( N = 3,147) that they encountered in their professional career. Thus the data entails a period of multiple years from a player's first professional match until the date this data was acquired (June 2014). This data included the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee. The data set was made available as a list with 146,028 dyads of players and referees ( https://osf.io/47tnc/ ).
Players' photos were available from the source for 1,586 out of 2,053 players. Profiles for which no photo was available tended to be relatively new players or players who had just moved up from a team in a lower league. The variable player skin tone was coded by two independent raters blind to the research question who, based on the profile photo, categorized players on a 5-point scale ranging from 1 = very light skin to 5 = very dark skin with 3 = neither dark nor light skin as the center value ( r = 0.92; rho = 0.86) . This variable was re-scaled to be bounded by 0 ( very light skin ) and 1 ( very dark skin ) prior to the final analysis to ensure consistency among effect sizes between teams and to reflect the largest possible effect.
A range of potential independent variables was included in the data concerning the player, the referee, or the dyad. The complete codebook is available at: https://osf.io/9yh4x/. For players, data included their typical position, weight, and height at the time of data collection, and for referees, their country of origin. For each dyad, data included the number of games referees and players encountered each other and the number of yellow and red cards awarded. The variables of age, club, and league were only available for players at the time of data collection, not at the time of receiving the particular red card sanctioning.To protect their identities given the sensitivity of the research topic, referees were anonymized and listed by a numerical identifier for each referee and for each country of origin. For the country of each referee, we included average scores of implicit and explicit preferences for light vs. dark skin tone that had been gathered in independent research by Project Implicit (30, 31) . Implicit preference scores for each referee country had been calculated using a skin tone Implicit Association Test (IAT) (32) , a speeded response task that assesses strength of associations. Higher scores on the IAT reflect a stronger automatic association between dark skin, relative to light skin, and negative valence. Explicit preference scores for each referee country were calculated using a feeling thermometer task, with higher values corresponding to greater self-reported feelings of positivity toward light skin tone versus dark skin tone. Both these national-level measures were created by aggregating data from many online users from referees' countries taking these tests on Project Implicit ( https://implicit.harvard.edu/ ; see also (33) ).
At registration we asked team leaders for their present opinion regarding the research questions with a single question for each hypothesis, e.g. "How likely do you think it is 6 that soccer referees tend to give more red cards to dark skinned players?"
After removing description of the results, the structured summaries were collated into a single questionnaire and distributed to all the teams for peer review. The analytic approaches were presented in a random order and researchers were instructed to provide feedback on at least the first three approaches that they examined. Researchers were asked for both qualitative feedback as well as the assessment: "How confident are you that the described approach below is suitable for analyzing the research questions?", measured on a 7-point scale from 1= Unconfident to 7 = Confident (see S3). Each team received feedback from an average of about 5 other teams (M = 5.32, SD = 2.87). The qualitative and quantitative feedback was aggregated into a single report and shared with all team members. As such, each team received peer review commentaries about their own and other teams' analysis strategies. Notably, these commentaries came from reviewers that were highly familiar with the data set, yet at this point teams were unaware of others' results (see https://osf.io/evfts/ and https://osf.io/ic634/ for the complete survey and round-robin feedback)....When researchers scrutinized others' results, it became apparent that differences in results may have not only be due to variations in statistical models, but also due to variations in the choice of certain covariates. Doing a preliminary reanalysis, the leader of team 10 discovered that the controversial covariates league and country may be responsible for making some results appear non-significant. A debate emerged regarding whether the inclusion of these covariates was quantitatively defensible (see https://osf.io/2prib/ ). The project coordinators thus asked the 10 teams who had included these variables in their final models to re-run their models without said covariates. Additionally, we asked these teams to decide whether to keep their prior version or use the results from the updated analysis. The results displayed in the manuscript reflect teams' choices of their final model 1 .
From the 79 researchers who initially registered for the crowdstorming project, 33 teams were formed and submitted an initial analytical approach. Of those, 29 teams also submitted a final report. Submitted analytical approaches were diverse, ranging from simple linear regression techniques to complex multilevel regression techniques and Bayesian approaches. Table 1 shows each team's analytic technique, reported effect size, and a number of characteristics describing how their model was specified (e.g., the number of covariates used in the analysis). In total, there were 21 unique combinations of covariates among the 29 teams. Apart from the variable 'games', which was used by all teams, just one covariate (player position, 62%) was used in more than half of the analytic strategies and three were used in just one analysis. Two sets of covariates were used by three teams each, and four sets of covariates were used by two teams each. All other 15 teams used a combination of covariates, which only their own team used. Table 1 shows variation in analytic strategies for number of covariates (M = 2.83 Stdev = 2.05), treatment of the non-independent structure of the data, statistical distribution chosen for theoutcome, and reported effect sizes. More detail regarding specific covariates chosen by each team can be seen in Table 2. Reasons that teams gave for their initial inclusion/exclusion of particular covariates can be found at https://osf.io/sea6k/
.
For the primary research question, researchers' conclusions varied regarding whether or not soccer referees were more likely to give red cards to dark skin toned players than light skin toned players. Fig. 1 shows the effect sizes and 95% confidence intervals alongside the description of the analytic approach provided by each team. Statistical results ranged from 0.89 (slightly and non-significantly negative) to 2.93 (moderately positive) in odds ratio units 2 , with a median of 1.31. From a null hypothesis significance testing standpoint, twenty teams (69%) found a significant positive effect and nine teams (31%) observed a non-significant relationship. No team reported a significant negative relationship.
Overall, teams who employed logistic or Poisson models reported estimates that were somewhat larger than teams using linear models. More specifically, 15 teams used logistic models (11/15 significant, median OR = 1.34, MAD = 0.07), six teams used Poisson models (4/6 significant, median OR = 1.36, MAD = 0.08), six teams used linear models (3/6 significant, median OR = 1.21, MAD = 0.05), and two teams used models classified as miscellaneous (2/2 significant).
Teams also varied in their approaches to handling the non-independence of players and referees, which resulted in variability regarding both median estimated and rates of significance. In total, 15 teams used random effects (12/15 significant, Median OR = 1.32, MAD = 0.12), eight teams used clustered standard errors (4/8 significant, Median OR = 1.28, MAD = 0.13), five teams did not account for this artifact (4/5 significant, Median OR = 1.39, MAD = 0.28), and one team used fixed effects for the referee variable (0/1 significant, OR = 0.89).
After the discussion, and before seeing the draft of this report, most teams agreed moderately that the data showed a positive relationship between number of red cards and player skin-tone. In this final survey, a set of supplementary items assessing agreement with more nuanced beliefs (e.g., "There is little evidence for an effect," "The effect is positive and due to referee bias") revealed greatest endorsement (78% agreement) of the position that "The effect is positive and the mechanism is unknown" (M = 5.32, SD = 1.47 on a scale ranging from 1 = strongly disagree to 7 = strongly agree ; see S7 for more details).
Here, we demonstrate that variation in effect size is also present in the same data contingent on choices and assumptions in the analysis process. We observed variation in the effect estimates of whether soccer referees gave more red cards to dark skin toned players. We also observed convergence on the discrete judgment of whether there was a positive effect in the data. These crowdsourcing results illustrate both the contingency of effects as a function of analytic choices, and the opportunity for converging beliefs through shared examination and evaluation of a research question using a shared data set. The median result (OR = 1.31) indicated that the odds were 31% higher for players rated as having the darkest skin tone to receive a red card when compared to players rated as having the lightest skin tone. Assuming a 40 game season, these results suggest that the probability of receiving at least one red card over a season is 15.2% for a player with the darkest skin tone and 11.8% for a player with the lightest skin tone. 4 The estimated effects ranged from 0.88 to 2.93 in odds ratio units (1.0 indicates a null effect), with zero teams finding a negative effect, nine teams finding no relationship, and twenty teams finding a positive effect. If, as in virtually all other research projects, a single team had conducted the study, selecting randomly from the present teams, there would have been a 69% likelihood of reporting a positive result and a 31% likelihood of reporting a null effect.
Crowdsourcing of data analysis is inefficient in that numerous analysts conduct multiple rounds of data analysis to answer a single research question. But, consider that inefficiency in comparison to the status quo in which a research question is examined and reported using a single analysis strategy. Conventional practice makes little accommodation for the possible contingency of the results on the analytic method (2, 6) . Moreover, misspecification of results via analysis strategy is virtually undetectable without an ethic of open data and community review of analytic strategies. It is conceivable that the relative inefficiency trade-offs would actually produce a net benefit by having many independent analysts for a complex data set compared to the currently prevalent practice of individual analysis teams providing stand-alone analyses of privately held data. Further, the use of 29 independent teams helped us illustrate the variation in analytic strategies and conclusions, but - in practice - fewer independent teams may be needed to assess robustness of conclusions."
#statistics #replication