the release of large language models like ChatGPT (a question-answering chatbot) and Galactica (a tool for scientific writing) has revived an old conversation about what these models can do. Their capabilities have been presented as extraordinary, mind-blowing, autonomous; fascinated evangelists have claimed that these models contain “humanity’s scientific knowledge,” are approaching artificial general intelligence (AGI), and even resemble consciousness. However, such hype is not much more than a distraction from the actual harm perpetuated by these systems. People get hurt from the very practical ways such models fall short in deployment, and these failures are the result of their builders’ choices—decisions we must hold them accountable for.
Among the most celebrated AI deployments is that of BERT—one of the first large language models developed by Google—to improve Google search engine results. However, when a user searched how to handle a seizure, they received answers on exactly what not to do—including being told to inappropriately “hold the person down” and “put something in the person’s mouth.” Anyone following the directives Google provided would thus be instructed to act in the exact opposite manner recommended by a medical professional, potentially resulting in death.
The Google seizure error makes sense, given that one of the known vulnerabilities of LLMs is their failure to handle negation. Allyson Ettinger demonstrated this years ago with a simple study. When asked to complete a short sentence, the model would answer 100 percent correctly for affirmative statements (“a robin is …”) and 100 percent incorrectly for negative statements (“a robin is not …”). In fact, it became clear that the models could not actually distinguish between the two scenarios and provided the exact same responses (using nouns such as “bird”) in both cases. This remains an issue today and is one of the rare linguistic skills to not improve as the models increase in size and complexity. Such errors reflect broader concerns raised by linguists on how much such artificial language models effectively operate via a trick mirror—learning the form of what the English language might look like without possessing any of the inherent linguistic capabilities that would demonstrate actual understanding.
Additionally, the creators of such models confess to the difficulty of addressing inappropriate responses that “do not accurately reflect the contents of authoritative external sources.” Galactica and ChatGPT have generated, for example, a “scientific paper” on the benefits of eating crushed glass (Galactica) and a text on “how crushed porcelain added to breast milk can support the infant digestive system” (ChatGPT). In fact, Stack Overflow had to temporarily ban the use of ChatGPT-generated answers as it became evident that the LLM generates convincing but wrong responses to coding questions.
Several of the potential and realized harms of these models have been exhaustively studied. For instance, these models are known to have serious issues with robustness. The sensitivity of the models to simple typos and misspellings in the prompts and the differences in responses caused by even a simple rewording of the same question reveal an inconsistency that makes it unreliable for actual high-stakes use, such as translation in medical settings or content moderation, especially for those with marginalized identities. This is in addition to a slew of now well-documented roadblocks to safe and effective deployment—such as how the models memorize sensitive personal information from the training data, or the societal stereotypes they encode. At least one lawsuit has been filed, claiming harm caused by the practice of training on proprietary and licensed data. Dishearteningly, many of these “recently” flagged issues are actually failure modes we’ve seen before—the problematic prejudices being spewed by the models today were seen as early as 2016, when Tay the chatbot was released, and again in 2019 with GTP-2. But as models get larger over time, it becomes harder and harder to document the details of the data involved and justify their environmental cost.
Yet ongoing asymmetries of blame and praise persist. Model builders and tech evangelists alike attribute impressive and seemingly flawless output to a mythically autonomous model, a technological marvel. The human decision-making involved in model development is erased, and a model’s feats are observed as independent of the design and implementation choices of its engineers. But without naming and recognizing the engineering choices that contribute to the outcomes of these models, it’s almost impossible to acknowledge the related responsibilities. As a result, both functional failures and discriminatory outcomes are also framed as devoid of engineering choices—blamed on society at large or supposedly “naturally occurring” datasets, factors the companies developing these models claim they have little control over. But they do have control, and none of the models we are seeing now are inevitable. It would have been entirely feasible to make different choices that resulted in entirely different models being developed and released.