Large language models (LLMs) and generative artificial intelligence (GenAI) have a plagiarism problem. And it’s not just confined to individuals seeking an unfair advantage. The problem has been baked in since the beginning of GenAI.
As third-level colleges wind down and assignments are graded, it is universally acknowledged that students are using chatbots and virtual assistants much more often. The most common use seems to be deploying GenAI (artificial intelligence capable of generating text, images and other data, typically in response to prompts) to tweak and improve a completed essay.
Alternatively, people use AI-generated content and then fact-check and paraphrase it. Both methods avoid the hallucination problem, that is, a virtual assistant merrily making up plausible answers and references. Is this just an updated version of using essays from a student a year or two ahead of you? It is still cheating. But the long-term consequences of relying on AI are much more far-reaching.
The same plagiarism problem exists with coursework at Leaving Cert level. There is significant disquiet about senior-cycle reform which mandates that every subject will have what are called additional assessment components – that is, assessments that are not traditional written exams. Currently, before senior-cycle reform is completed, of 41 subjects offered at Leaving Cert level only 12 did not have an additional assessment component. (Many of these, such as oral exams, are not susceptible to the use of AI but others, such as research projects and essays, definitely are.)
The principal can’t sleep for worrying. If she paid all the bills on her desk, she couldn’t open the school
Covid-19 left deep scars in Irish society. Those whose lives were lost or upended deserve better
Men are suffering a crisis of meaning. And some are finding answers in orthodox religion
Students from Republic missing out on UK places because of junior cycle marking
[ You’re not imagining it - the internet really is getting worse. Here’s whyOpens in new window ]
Undisclosed use of GenAI has also infested scientific research. One study (where the researchers obviously had a highly developed sense of irony) analysed scientific peer review in AI conferences after the advent of ChatGPT. The results showed that somewhere between 6.5 per cent and 16.9 per cent of text submitted as peer reviews to these conferences could have been substantially modified by LLMs.
The greatly increased use of adjectives such as “commendable”, “meticulous”, and “intricate” was one of the giveaways. At the moment, AI flattens language to blandness but that will soon change.
The actor Scarlett Johansson was outraged recently when one of OpenAI’s new voice assistants for ChatGPT4o allegedly sounded so much like her that even friends and family were fooled. She had been approached twice for permission to use her voice by Sam Altman, OpenAI’s chief executive – you know, the guy who got fired and reinstated within a week for allegedly being less than candid with his board?
She said no both times. OpenAI denies that Sky, a breathy, flirtatious voice which is one of five options for conversing with ChatGPT4o, has anything to do with Johansson but has still withdrawn the voice. Johansson was the voice of Samantha in Spike Jonze’s 2013 movie Her, where Theodore, a lonely introvert played by Joaquin Phoenix, falls in love with the near-omniscient virtual assistant. Spoiler alert – Samantha is carrying on conversations with 8,316 other people as she talks to Theodore and is in love with 641 of them. Altman loves the movie.
The mechanisms used by GenAI companies to train their LLMs are like something from science fiction. Immense, unfathomable amounts of data are needed
Johansson is a powerful, rich person but other less well-known voice actors making a modest living allege their voices have been copied and used by GenAI companies. The New York Times reported the case of a couple named Paul Skye Lehrman and Linnea Sage, who got most of their voice acting gigs on Fiverr, a low-cost freelance site. The couple alleges that Lovo, a Californian start-up, created clones of their voices illegally and that this threatens their ability to make a living.
The New York Times itself has to date spent $1 million suing OpenAI and Microsoft. It claims the companies breached fair use by not only training their LLMs on New York Times articles but also by reproducing pieces virtually word for word in chatbot answers.
The mechanisms used by GenAI companies to train their LLMs are like something from science fiction. Immense, unfathomable amounts of data are needed. A team led by a New York Times technology correspondent Cade Metz discovered that by 2021 OpenAI had already used every respectable source of internet English-language texts – Wikipedia, Reddit and millions of websites and digital books. So OpenAI invented technology to transcribe video and audio, including YouTube.
You might think that Google, which owns YouTube, would have objected to this blatant breach of YouTube fair use regulations, but no. Metz alleges that Google could not point the finger at OpenAI because it was busy doing the same thing. Authors, visual artists, musicians and computer programmers are just some of the groups suing GenAI companies for using their work without permission and thereby threatening their livelihoods.
Ireland depends on big tech for economic viability. GenAI is the latest profit-seeking battleground but appears to be based on levels of plagiarism that make students furtively trying to improve essays by using ChatGPT look like rank amateurs.
GenAI is an extraordinary, world-changing technological development but is built on unpaid, stolen labour. AI is currently mostly feeding off the creativity and livelihoods of the vulnerable and powerless. But cheating has a way of coming back to bite the cheater. Could our sanguine embrace of AI’s cheating heart hasten the redundancy not just of the vulnerable but also of countless roles that once were the exclusive province of human beings?