If science is an objective means of seeking truth, it’s also one that requires human judgments. Let’s say you’re a psychologist with a hypothesis: People understand that they may be biased in unconscious ways against stigmatized groups; they will admit this if you ask them. That might seem like a pretty straightforward idea—one that’s either true or not. But the best way to test it isn’t necessarily obvious. First, what do you mean by negative stereotypes? Which stigmatized groups are you talking about? How would you measure the extent to which people are aware of their implicit attitudes, and how would you gauge their willingness to disclose them?

These questions could be answered in many different ways; these, in turn, may lead to vastly different findings. A new crowdsourced experiment—involving more than 15,000 subjects and 200 researchers in more than two dozen countries—proves that point. When various research teams designed their own means of testing the very same set of research questions, they came up with divergent, and in some cases opposing, results.

The crowdsourced study is a dramatic demonstration of an idea that’s been widely discussed in light of the reproducibility crisis—the notion that subjective decisions researchers make while designing their studies can have an enormous impact on their observed results. Whether through p-hacking or via the choices they make as they wander the garden of forking paths, researchers may intentionally or inadvertently nudge their results toward a particular conclusion.

The new paper’s senior author, psychologist Eric Uhlmann at INSEAD in Singapore, had previously spearheaded a study that gave a single data set to 29 research teams and asked them to use it to answer a simple research question: “Do soccer referees give more red cards to dark-skinned players than light-skinned ones?” Despite analyzing identical data, none of the teams came up with exactly the same answer. In that case, though, the groups’ findings did generally point in the same direction.

The red card study showed how decisions about how to analyze data could influence the results, but Uhlmann also wondered about the many other decisions that go into a study’s design. So he initiated this latest study, an even larger and more ambitious one, which will be published in The Psychological Bulletin (data and materials are shared openly online). The project started with five hypotheses that had already been tested experimentally but on which results had not yet been published.

Aside from the hypothesis about implicit associations described above, these concerned things like how people respond to aggressive negotiating tactics or what factors could make them more willing to accept the use of performance-enhancing drugs among athletes. Uhlmann and his colleagues presented the same research questions to more than a dozen research teams without telling them anything about the original study or what it had found.

The teams then independently created their own experiments to test the hypotheses under some common parameters. The studies would have to be carried out online, with participants in each drawn at random from a common pool. Each research design was run twice: once on subjects pulled from Amazon’s Mechanical Turk and then again on a fresh set of subjects found through a survey company called Pure Profile.

The published study materials show how much variation there was across research designs. In testing the first hypothesis, for example, that people are aware of their unconscious biases, one team simply asked participants to rate their agreement with the following statement: “Regardless of my explicit (i.e. conscious) beliefs about social equality, I believe I possess automatic (i.e. unconscious) negative associations towards members of stigmatized social groups.” Based on responses to this question, they concluded that the hypothesis was false: People do not report an awareness of having implicit negative stereotypes.