Despite bold claims that generative models could ace law exams, new research has revealed the technology still struggles to compete in the classroom.
When ChatGPT was first unveiled to the world back in late 2023, educators and students alike were rattled by big claims from parent company OpenAI that the generative tech could ace notoriously tough law exams at the press of a simple prompt.
And these boasts weren’t pulled out of thin air. Supposedly the generative AI went head-to-head against human test takers in a simulated version of the US bar, only to emerge victorious by scoring higher than 90 per cent of the cohort (or so the story goes).
It’s an interesting concept, and one that fascinated University of Wollongong (UOW) law expert Dr Armin Alimardani so much that he wanted to investigate further, to find the answers to some important questions.
“The OpenAI claim was impressive and could have significant implications in higher education, for instance, does this mean students can just copy their assignments into generative AI and ace their tests?”
Now forming the basis of his new paper, Generative Artificial Intelligence vs. Law Students: An Empirical Study on Criminal Law Exam Performance, published earlier this week in the Journal of Law, Innovation and Technology, Alimardani decided to put the AI tools to the test to see how they would fair in a real classroom setting.
“Many of us have played around with generative AI models and they don’t always seem that smart, so I thought why not test it out myself with some experiments,” he says.
The experiment
In his role as subject coordinator of Criminal Law during the second semester of 2023, Alimardani was in charge of posing questions for the final exam that department tutors would mark.
But unbeknownst to the educators, ten ‘AI students’ would be joining the mix, with half generating an answer using different versions of ChatGPT and the other half using a variety of prompt engineering techniques to enhance their responses.
“My research assistant and I hand-wrote the AI-generated answers in different exam booklets and used fake student names and numbers. These booklets were indistinguishable from the real ones,” says Alimardani.
“Each tutor unknowingly received and marked two AI papers and my mission impossible was accomplished.”
The results
Of the 225 students who took the test, Alimardani found the average score was 40 out of 60 possible marks (around 66 per cent), with the AI papers typically performing in the lower percentile.
“For the first lot of AI papers which didn’t use any special prompt techniques only two barely passed and the other three failed,” he says.
“The best performing paper was only better than 14.7 per cent of the students. So this small sample suggests that if the students simply copied the exam question into one of the OpenAI models, they would have a 50 per cent chance of passing.”
The other five papers that used prompt engineering tricks performed better, although they were still far from the results seen in OpenAI’s own testing.
“Three of the papers weren’t that impressive but two did quite well. One of the papers scored 44 (about 73 per cent) and the other scored 47 (about 78 per cent).” Alimardani says.
“Overall, these results don’t quite match the glowing benchmarks from OpenAI’s United States bar exam simulation and none of the 10 AI papers performed better than 90 per cent of the students.”
What does this mean for education?
While human-generated content might still win out in the classroom (for now), Dr Alimardani said a surprising takeaway was that none of the tutors suspected any papers were AI-generated and were genuinely surprised when they found out.
“Three of the tutors admitted that even if the submissions were online, they wouldn’t have caught it,” he says.
“So if academics think they can spot an AI generated paper, they should think again.”
He also initially thought ‘hallucination’ – the use of fabricated information – would crop up in the generated papers. However, the testing found the models performed well at staying on track with legal principles and facts provided in the exam.
Instead, Alimardani says the real problem was ‘alignment’ – the degree to which AI-generated outputs match the user’s intentions.
“The AI generated answers weren’t as comprehensive as we expected. It seemed to me that the models were fine tuned to avoid hallucination by playing it safe and providing less detailed answers,” he says.
“My research shows that people can’t get too excited about the performance of GenAI models in benchmarks. The reliability of benchmarks may be questionable and the way they evaluate models could differ significantly from how we evaluate students.”
Implying that graduates who know how to work with AI could have an advantage in the job market, Alimardi says it was up to educators to teach students how to use the generative tool effectively in their writing.
“Prompt engineering can significantly enhance the performance of GenAI models, and therefore it is more likely that future employers would have higher expectations regarding students’ GenAI proficiency.
“It’s likely students will be increasingly assessed on their ability to collaborate with AI to complete tasks more efficiently and with higher quality.
“Law schools and educators from other disciplines must focus on developing critical analysis skills to help students collaborate more effectively with AI, but it’s ultimately up to universities to make sure students learn the necessary knowledge and skills before they collaborate with AI.”