Toward A Critical Multimodal Composition

Sierra S. Parker

Data Sets and Biases

AI's inherent human elements embedded through programming and training sets create the potential for outputs that reinforce hegemonic biases, damaging stereotypes, and structural discrimination. Despite the ways that text-to-image generative AI seem capable of depicting anything described, biases can still be reproduced not only through the user's text prompts and the user's agency but also through the models themselves: "AI models are politically and ideologically laden ways of classifying the rich social and cultural tapestry of the Internet—which itself is a pale reflection of human diversity" (Vartiainen and Tedre 16). AI can only create outputs based on the data it has access to and, given that text-to-image generative AI are trained through databases and neural networks like ImageNet and Contrastive Language-Image Pretraining (CLIP), they are going to reproduce biases already in the data sets and sedimented across the web.

Many text-to-image generative AI are trained through an ImageNet or CLIP model, both of which rely on text-image pairs and aim to make object recognition possible in machine vision. ImageNet is a researcher-created image database containing over 14 million images from the internet that have been manually labeled and grouped into interconnected sets based on concepts (with concepts being articulated through cognitive synonyms) (ImageNet). CLIP is OpenAI's machine learning model; it is the training used for OpenAI's Dall-E 2 generator (OpenAI). Rather than manually labeled images like ImageNet, CLIP relies on text that is already paired with images publicly available on the internet. Its source of data comes with benefits: CLIP is less constrained by labor costs, the AI trained on CLIP are adept with everyday natural language used on the web, and the images the AI can produce are less limited and directed by researcher-created categories. Furthermore, the images that AI can produce with CLIP will develop and change along with the internet, enabling responses to new trends and styles. Despite these benefits, however, the lack of finite data sets means that the reason the AI produces images in response to language prompts is less clear. With a finite data set, researchers can look to the data to understand why, for example, all the images produced for a particular textual descriptor contain feminine presenting people. A finite data set can be interrogated, understood, and even updated to ameliorate biases. Instead, AI trained on CLIP produces visuals based on the public internet at large, and what happens behind the scenes is in a black box, opaque to the user.

CLIP trained AI provide no explanations for the images produced and leave it up to user interpretation of the visuals, prompts, and cultures involved. Without easy answers supplied for depicted biases, these AI are fruitful technologies for interrogation of how biases are (re)produced and how they proliferate. In other media, like television or film, depicted biases might be framed as the fault of specified groups of people who impose their perspectives on the product. For example, a TV show cast of all white actors could be attributed to a casting director or a production company. With CLIP AI, however, the products stem from composite biases formed from all the public texts of internet users—the AI scrape biases from culture at large. I have chosen to base this chapter's analysis on Dall-E 2 and Bing Image Creator (a separate AI owned by Microsoft that is powered by OpenAI's technology) because of how their CLIP training makes them ripe for this critical interrogation. An additional rationale for this decision is that both Dall-E 2 and Bing Image Creator are free to use and function entirely online, requiring no additional programs to run and making the two platforms more accessible for classroom use and financially accessible for students. Despite my decision to use these two models in this chapter, the heuristics that I present here remain broadly applicable to any generative AI model. For example, I have successfully conducted the same activities with students using other AI models like Microsoft Copilot.

Biases in AI are often caused by representational bias in the data set. Representational bias stems from incomplete or non-comprehensive data sets that do not accurately reflect the real world. Since access to the internet is itself an economic privilege not equitably available to everyone, an economic representational bias is inherent to AI. Populations with greater access to the internet will likely compose a larger number of the text-image pairings that the AI has access to and, thus, socioeconomic status influences whose voices, languages, and cultures orient the AI's training and output. Three readily identifiable types of representational bias stemming from the training set in text-to-image generative models include "misrepresentation (e.g. harmfully stereotyped minorities), underrepresentation (e.g. eliminating occurrence of one gender in certain occupations) and overrepresentation (e.g. defaulting to Anglocentric perspectives)" (Vartiainen and Tedre 15). Sriniasan and Uchino offer two additional examples of influential representational bias from the perspective of art history: (1) biases in representing art styles through generalization or superficial reflections, and (2) biased historical representations that do not accurately reflect the reality of the event or period. These various kinds of representational biases can have negative sociocultural effects like spreading misinformation, influencing how groups and cultures are referred to and remembered, and creating misunderstandings about historical moments in public memory.

Focusing on CLIP in particular, Dehouche finds that language and image identifiers are paired in ways influenced by cultural biases (Dehouche). For example, attractiveness is linked to femininity and richness is linked to masculinity. Dehouche compares these connections between terms to trending connections between how gender is referred to in the English language, noting that the English language tends to associate adjectives that express richness and poorness to male subjects more often and adjectives communicating attractiveness and unattractiveness to female subjects more often; the AI using CLIP will, resultingly, represent the same gendered stereotypes. In this way, AI are bound to reinforce biases and stereotypes found in the culture and language from which its images and caption pairings are taken.

Pre-training like CLIP is not the end to the bias-producing processes of AI models, however, as post-training processes also present their own issues. Post-training is one way that developers of AI models attempt to improve AI outputs and ameliorate the biases (re)produced through a model. Human feedback is one source of post-training used for the purpose of correcting or improving characteristics of AI produced content.1 Post-training still presents its own problems because attempting to make the model more aligned with particular values can lead to the model applying those values indifferently across all situations. Google's AI model Gemini provides an example of how this design philosophy and design adjustments can still go awry, however. Google has long faced controversies about its technologies' ability to represent diversity, even before artificial intelligence was widely available; thus, having an AI model that could represent variation was important to their image. When Google attempted to encourage Gemini to produce images containing diverse people, however, the AI applied this desire for diverse representation to all situations, including images of German Nazi soldiers from the 1940s (Grant). What was meant to be a reinforcement of values to improve the model and correct representational biases instead led to images of people of color in Nazi uniforms, one of these images that circulated after the controversy being of a Black Nazi, for example. This training created new problems by depicting harmful historical inaccuracies. The bottom line, thus, is that AI lack the human judgment necessary to adequately respond to communication problems that arise in interactions between the model and its users; therefore, even with adaptation strategies like RLHF, models will continue to produce strange and biased representations.

As Joanna Zylinska explains, the biases that AI produce compel us to examine the values that our technologies uphold or repress: "The specific questions that need to be asked concern the modes of life that the currently available AI algorithms enable and disable: Whose brainchild (and bodychild) is the AI of today? Who and what does AI make life better for? Who and what can’t it see? What are its own blind spots?" (Zylinska 29). These are the questions that teachers can bring to the composition classroom.


1Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are two kinds of post-training based on user feedback that is used for this purpose.