The Black-Boxed Ideology of AWE
Antonio Hamilton and Finola McMahon
Conclusion: Unpacking the Black-Box
If we recall Ernst's (2020) assertion that "AWE used for formative assessment does not improve student writing and may even worsen it," it underscores how when it comes to the black-boxing of the AWE, we do not know what guides the assessment choices concretely. In our study, we found it difficult to know how the AWE is assessing writing (and when we consider this assessment being applied differently across programs, this difficulty increases exponentially). This lack of knowledge could be why AWE may worsen an individual's writing. Some may argue that this black-boxing reflects rubric grading or a writing instructor's unwritten grading criteria. But the difference that remains is a dialogue can be had with an instructor to engage in this socially embedded practice of writing. Instead, with AWE, the writer is pressured to accept the assessment, while the programmers are able to conceal themselves behind a digital wall of unknown codes being enacted by an untalkative system of rules. This situation forces the writer to not only understand the writing they are doing, but why the AWE assessed them the way it did. How often have we sat next to a person (or been the person ourselves) playing a digital game and yelling when the software glitched or disrupted the player unexpectedly? The same frustration, we believe, exists when an AWE provides an unexplained, unwanted assessment towards a writer.
The companies of these AWE would likely defend themselves behind proprietary reasons for why they keep their algorithms occluded to the public. But in doing so, they create technical illiteracy not just of their algorithm but of the output of that algorithm. These systems are asking the writer to blindly trust and treat these programs as the purveyors of writing knowledge. While completely "unboxing" the algorithm may not be a realistic possibility in the near future, the AWE at the very least should be forthcoming about what guiding research of writing assessment was used to build the algorithm and the types of data informing the system's feedback. Christian (2020) aptly recognizes the potential danger when we do not know who or what is represented in the training data for algorithms to perform their functions. And when the history of writing is so vast, obviously the programmers are making choices for what they find useful for the AWE. That is if they are considering the history of writing at all, as compared to programming based on their own writing education. Knowing or understanding those choices would be beneficial for the writer to understand what a particular AWE can provide. This knowledge would not just be in what the AWE advertises, but rooted in knowing the true capabilities and effectiveness of the algorithm's assessment. This is more important when the social engagement of the writing process is moved from human to computer. But this was preliminary research to contribute to conversations on AWE's functions in writing assessment. These programs are often looked at as being beneficial for their utility to help with labor management or instant accessibility. Feedback and assessment however, are an integral part of the writing process, and it would be remiss to ignore or to not intricately critique its shortcomings, especially when these programs are being used in other subject fields beyond English.
These conversations must also be understood in connection with those around large language models (LLM), such as GPT-3 and ChatGPT, which would also help us to understand how writing is conducted in online environments beyond assessment. Similar concerns remain: what writing styles are valued by these models? With LLMs algorithmic black-boxing is still a major issue because we do not explicitly know what data is informing the system and what writing histories played a role in shaping the output by the algorithm, even if some of these systems provide general information on their data source such as OpenAI's scrapping of the internet.
Inevitably, these programs are becoming increasingly prevalent in everyday life. Think of how many people are now using Grammarly in academic and other contexts. Given that prevalence, we can not simply accept AWE as they are or trust that they function as they claim. We must take the time to truly unpack and understand these algorithms or at the very least the information that informs them. What standards and norms of writing are we enforcing? Who might those standards exclude and will they reinforce harmful power structures which devalue a student's language? And, how do these AWE capture or reject students' individual writing styles and choices? To know, we need to unpack the black-box.