51吃瓜

Could generative AI improve the REF?

The task of reading and rating the thousands of outputs submitted to the UK’s Research Excellence Framework is notoriously Herculean. Could AI ease the burden – or would its use undermine the whole point of having REF panels? As Jisc consults on that question, four writers offer their views

六月 5, 2025
A robot game advert. The amazing REF robot “always gives the right answer”, with a robot pointing to scores of 1 to 4. To illustrate whether generative AI can improve the REF.
Source: My Childhood Memories/Alamy (edited)

‘Generative AI is not thinking and knows nothing – it must be approached with caution’

Should we use?artificial intelligence to assist both submission and panel assessments in REF 2029? What a question. The suggestion that we might, recently raised by a , has caused quite the kerfuffle.

What has been exposed in the fuss is the extent to which academic perspectives on AI and its uses diverge, and how little contact or understanding there appears to be between those varying perspectives.

Of course, generative AI has been causing a stir since the first versions of what is now ChatGPT were launched into the world in late 2022. The large language models (LLMs) that lie behind the interfaces were launched amid a rapture of prophesying; this was a technology that was either going to propel a great leap forward for humanity or herald a new age of enslavement, with the machines finally pushing their former masters aside. Businesses were told to embrace the future or risk extinction, while even some of the researchers involved in creating the tools warned gravely of risks to the survival of civilisation as we know it. Tech companies themselves made ever grander claims for the transformational power of their products, wooing politicians desperate to bask in a warming blast of technological white heat, scooping up huge quantities of investors’ money, and carving out unique exemptions from such trifling matters as copyright law and government regulation. Meanwhile, larger and larger data centres spring up, and energy demands skyrocket as environmental concerns are lost in the rush.

Inevitably, GenAI has already had a big impact on university life but the responses from within different parts of academia have varied significantly. One of the most immediate reactions was a belief that all the safeguards against cheating in assessments put in place over the years, including plagiarism-detection software, were now basically useless: students would be able to type their essay questions directly into an AI interface and get back a fully-formed assignment within seconds. Some initial studies suggested ChatGPT could easily ace all manner of assessments, provoking calls for a return to in-person, exam-based assessment from the reliance on essays established over previous decades.


Campus collection: Research excellence: what is it and how can universities achieve it?


Three years down the line, such fears look overblown – those headline-grabbing studies were not as robust as they first appeared, and frontline experience suggests that AI-written essays are bland, banal, repetitive and unoriginal, not as a result of current limitations of the technology but precisely because of its probabilistic workings. Hallucinations and absurdities abound, precisely because GenAI is not thinking and knows nothing, making it a resource to be approached with caution and care – a lesson that seems to have been learned by many students even as it continues to evade, say, some

At the same time, there is a growing, but perhaps still furtive, sense that some GenAI tools and capabilities might be useful for research and teaching, even in the humanities. While there are many scholars in such disciplines, especially, who wish to hold the line against any academic accommodation of AI – particularly because training LLMs has involved theft of copyright works on an epic scale and a potentially mind-boggling environmental price – there is every likelihood that the stable door is banging in the wind and its GPT Image 1-generated occupant long since departed.

The important thing will be to remain alert to the ethical dangers and sceptical of potential uses, while acknowledging that the tech is going to be hard to avoid. Academic publishers are licensing their titles en masse to AI companies, and the capabilities are now embedded in much of the software academics and students use on a daily basis. A flat refusal to countenance AI, or a blanket prohibition on doing so, will not help lecturers or students work out how to engage with it safely.

A REF robot operating a bulldozer, pushing aside books. To illustrate the concern, especially in the humanities, of using generative AI in REF assessments, with the “machines pushing aside their masters”
Source:?
Buyenlarge/Getty Images (edited)

This is all the more important because there are some areas of the university, and some disciplines, which have been less concerned about the advent of the AI age. Perhaps because the humanities, in particular, are focused on writing as a mode of thinking, and on intellectual originality as a core virtue and axis of assessment, machine writing appears an obvious threat. In areas with legitimately different disciplinary commitments, it can look much more like a potentially useful aid.

Such assumptions appear to underpin proposals for the deployment of AI tech not just in teaching or research but also in some of the key processes that govern university life. Hence, the Bristol survey’s question whether GenAI might be used not only in the writing of the narrative components of REF submissions, but also in the processes of panel assessment which will determine scores and results.

Nobody likes the REF – but it is the Procrustean bed we must apparently lie in if we are to have the QR funding on which so much research activity depends. Precisely because of the REF’s obvious limitations, reviewers and panellists try to bring both care and rigour to the impossible job of grading outputs, a process that leans heavily on the exercise of ineliminably human judgement.

To outsource any of that to AI would be to place far too great a trust in a technology that is not, in fact, intelligent, and whose workings remain in some ways opaque even to its creators. Here, at least, our way forward should be clear.

James Loxley is a professor of early modern literature at the University of Edinburgh.


‘A REF AI platform would free scholars for true intellectual appraisal’

The UK’s REF has become a monumental undertaking, requiring expert review of nearly 190,000 research outputs in the 2021 cycle at an estimated sector-wide cost of . Unsurprisingly, the advent of generative AI has sparked considerable interest as institutions seek ways to reduce this burden.

Against this backdrop, a recent by the University of Bristol and Jisc asked, “Should generative AI tools be used for the REF?” We argue here that the question is already obsolete; AI use is widespread and accelerating. The real challenge is to ensure that GenAI is applied responsibly so that it strengthens the REF process and serves the research community.

Today GenAI permeates many aspects of research. Commercial language models and AI tools routinely summarise manuscripts, identify citations, draft narratives and develop research arguments. Many academics and students use these tools . Attempting to ban them simply drives their use underground, suppressing open discussions and stifling sharing of best practice.

Yet rejecting a ban is not the same as endorsing uncontrolled adoption. Careless use of AI poses significant risks, as demonstrated when US attorneys were for filing a brief containing six non-existent ChatGPT-generated precedents. Without a common framework, researchers, departments and institutions will choose their own model, prompts, level of human review and disclosure. The result is a patchwork of practices, enabling well-resourced universities to establish responsible and technically advanced methods while less-funded peers fall behind, magnifying institutional inequities.

For REF 2029, the real choice facing the academic community therefore is: rally behind the adoption of a commonly agreed platform, or accept unchecked proliferation of ad hoc, opaque tools. The only credible response is to seize the initiative through strategic development of a custom AI system underpinned by the REF’s : inclusion, equity and transparency. This way we can ensure that AI is used responsibly, with proper oversight and rigorous evaluation, and designed to complement human intelligence, rather than replacing it.

A purpose-built REF AI system would restrict algorithms to the tasks they perform reliably – ie, data aggregation, pattern detection and process assurance – while preserving core qualitative judgements for human experts. By delegating repetitive, data-heavy tasks to algorithms, REF panellists can focus on the nuanced assessments of originality, significance and rigour that define research excellence. Research excellence is multidimensional and contextual, which is something best judged by experts who retain control over substantive evaluations while AI handles the clerical heavy lifting.

A robot driving a REF car which is analysing various papers. To illustrate generative AI being able to handle the clerical heavy lifting with regards to REF submissions.
Source:?
Buyenlarge/Getty Images (edited)

Unlike previous attempts, such as using a or to assign evaluation scores directly, the REF AI platform we propose is more technologically advanced, built on the most recent AI developments. It deploys a sophisticated ecosystem of specialised AI agents with authenticated access to paywalled journals, citation databases and scholarly archives. These purpose-built agents work collaboratively, each handling specific academic evaluation tasks while operating within a secure, auditable framework designed exclusively for research assessment. Everything is constantly evaluated and all aspects of the system, including its system prompts, are exposed to the users.

The integration of AI into the REF must adhere to strong governance principles, necessitating independent oversight by ethics and responsibility experts. The system would provide comprehensive uncertainty metrics and hallucination detection, with each automated operation recorded, bias-tested and fully traceable. A sandbox environment with a public API would permit external audits and let institutions test submissions ahead of time, removing the incentive to develop unverified in-house tools and reducing sector costs.

Does this sound like wishful thinking or science fiction? We, the authors, have already built such regulated AI platforms for pharmaceutical research and regulatory dossiers that satisfied auditors and scientists. The engineering components can be in place within months. The much greater challenge is the consultation that defines the detailed scope, risk tolerance, audit trails, access rights and performance metrics. Ensuring early engagement with users, independent regulators, and data-governance teams anchors the platform in policy and withstands scrutiny. If any AI platform is to have relevance for REF 2029, the time to act is now.

A sector-wide, audited AI platform would uphold REF 2029’s values, cut administrative costs and free scholars for true intellectual appraisal. The technology is ready; what remains is collective resolve and disciplined governance. As Christian Lous Lange warned in 1921: “Technology is a useful servant but a dangerous master.”

Caroline Clewley is an AI futurist at Imperial College London, advising on the integration of generative AI into education, and leads Imperial’s flagship I-Explore programme. Lee Clewley is vice-president of AI at eTherapeutics, a drug discovery company, and was formerly head of applied AI at GlaxoSmithKline and a postdoctoral researcher at the University of Oxford.


骋别苍础滨’蝉?“black-box” nature stands in opposition to the transparency essential to legitimate academic evaluation?

As the UK higher education community looks ahead to REF 2029, discussion has perhaps inevitably turned to the possibility of incorporating generative AI tools into the process. While some commentators, like , argue that using GenAI for the REF is a “no-brainer” that could help to reduce the substantial financial and labour burdens that submission currently imposes on the sector, this does not adequately address the significant concerns such use of AI raises.?

The previous REF exercise cost universities an average of ?3 million each on preparations. The investment in staff time was equally expansive – reviewing , demanding countless hours from academics and professional services staff.?

With such intensive resource demands, it’s understandable that institutions preparing for REF 2029 might consider whether GenAI could streamline this process to reduce costs and staff time. Given the vast quantity of research such AI tools have already been trained on, the need to ensure a standardised and criteria-driven reviewing process free from subjectivity and applied consistently, not to mention the efficiency of AI in comparison with a human reviewer, it may indeed seem like a “no-brainer”.?

However, the apparent benefits GenAI may bring to institutions in terms of speed, efficiency and cost reduction are overshadowed by potential harm to those already marginalised within institutions and the risk of entrenching existing biases.??

A REF robot with “see-through gear action”, illustrating that if generative AI is used for the REF then transparency is essential to legitimate academic evaluation.
Source:?
Buyenlarge/Getty Images (edited)

The use of GenAI also raises profound questions about authenticity and the value of the human dimension of research evaluation. What happens to considerations of lived experience, positionality and self-awareness when AI becomes the evaluator? How do we account for the nuanced understanding of unconscious bias in both the conduct and review of research??

For GenAI to work effectively in the REF submission processes, several critical conditions would need to be met:?

  1. Universities would need to develop purpose-built LLMs rather than relying on commercial GenAI tools, ensuring alignment with specific REF objectives and academic standards.?
  2. Training data would require meticulous curation to provide diverse knowledge bases, acknowledging and adjusting for inconsistencies and bias in the scholarly communication.
  3. Algorithms would need to be explicitly designed to account for systemic inequity in research funding, promotion decisions, seniority, workload allocations, and institutional non-research commitments, all of which influence output and hence whose work is submitted for the REF in the first place.
  4. Systems would need to recognise and account for documented biases in academic evaluation, including citation patterns favouring certain demographics and disciplines.
  5. Programming would need to disregard potentially biasing factors like metrics, publication venue, gender, departmental affiliation and academic position, and also prevent indirect inference through writing style analysis, linguistic patterns, or topic selection that might serve as proxy signals for demographic information.
  6. Design would need to avoid temporal bias, where topics with less historical representation would be assessed as less significant simply because they have fewer precedents in the literature, disadvantaging emerging topics or previously marginalised research areas.?
  7. Assessment criteria would need to acknowledge the varying suitability of AI evaluation across disciplines – recognising that STEM research might be more straightforwardly assessed than humanities scholarship or interdisciplinary work requiring nuanced contextual understanding.
  8. Implementation would need to balance transparency with opacity to maintain trust while preventing “gaming” of the system.

While theoretically possible, such sophisticated AI systems for REF 2029 present nearly insurmountable challenges. Developing bespoke, bias-mitigating AI tools would require immediate substantial investment from universities. The financial resources needed would likely exceed the very cost savings that make GenAI initially appealing, even with sector-wide collaboration.?

Using commercial AI tools trained on existing academic literature risks amplifying inequalities in academic evaluation and compromising the REF’s legitimacy as an academically-driven process. This approach would effectively outsource academic responsibility to external corporations whose priorities diverge from the values underpinning research assessment.?

There is a fundamental misalignment between commercial AI tools and academic assessment needs. Commercial tools prioritise user engagement, personalisation, generalisability and commercial application, rather than academic rigour, disciplinary nuance and transparency. Their training data includes vast amounts of internet content unlikely to reflect academic standards (let alone its level of bias, if not outright illegality), and their “black-box” nature stands in opposition to the transparency and accountability essential to legitimate academic evaluation.??

While financial and time pressures make AI-based solutions appealing, the key elements required for an effective, fair GenAI implementation in REF processes are currently missing. Without considerable investment in purpose-built systems designed specifically to counteract academic biases, GenAI tools risk accelerating and entrenching a conservative system that will further privilege established research traditions while systematically disadvantaging innovation and diversity in academic enquiry – precisely the opposite of what research assessment should encourage.?

Caroline Ball is academic librarian (business, law and social sciences) at the University of Derby.

请先注册再继续

为何要注册?

  • 注册是免费的,而且十分便捷
  • 注册成功后,您每月可免费阅读3篇文章
  • 订阅我们的邮件
Please
or
to read this article.

Reader's comments (11)

Oh yes! I think AI could do this very well and certainly as good as the present set of 'expert panels'. But those academics who are on the panels and run the exercise won't be prepared to give up their power and influence over the rest of us without a serious fight, so expect this proposal to receive heavy criticism as they protect their privileged positions with the career advancement and extra earnings that they command. The REF, as we all know, is a political exercise and a racket and AI will make it harder form them to operate. They will find ways to justify the obscene amounts that are spent on this exercise both directly and indirectly.
I agree. Once personal gain is built into REF, it is hard for those benefiting from it to relinquish it.
I think it was always the Arts and Humanities sector that, for very good reasons, made the strong case for peer review which, in theory if not in actual practice, is the most appropriate way of deterring research quality. Yes of course, a lot of people have a strong vested interest in this: notably the academics on the panel (very much power on display) and the Universities which employ them and get the obvious advantages from this (universities with panel members historically tend to do rather better than those that do not haha!), but that does not mean the methodology is not sound. But it does seem that the Arts and Humanities are defined to become less significant in the light of the current financial crisis and maybe it's time to think about metrics (informed by AI) if it can be done well. It will save valuable resource than can be used elsewhere. Perhaps UKRI might run a pilot to see if this could work, they are quick enough to run pilots on EDI related policy?
Before obsessing about the barn door, first ask was there even a horse in the barn? Given a set of papers graded by human assessors what results does AI give? Humanities text are hard enough, what about mathematical sciences where key parts of texts are symbols. Mathching a paper on symplectic geometry against examples trained on [say] calculus might yield some matches but tells you nothing about whether the former is 2* or 4* work. And the article conveniently overlooks the narrative that can accompany papers precisely to provide the human panel with context and evidence. On what basis would any LLM be able to interpret the narrative and use that to influence the evaluation. it is fine to raise concerns, but lets be realistic and base the analysis on hard evidence else you risk being the author or cried "wolf"
"Before obsessing about the barn door, first ask was there even a horse in the barn?" Eh? Do you mean the "Stable Door"? I associate the "barn door" with the phrase about someone being so useless they could not hit a "barn door", it being proverbially a large and easy target.
Well yes exactly, that's what I was thinking. After all, you would keep a horse in a stable wouldn't you. So it's no surprise there's not one in the barn? And if you were an author, you would be writing 'wolf' not crying 'wolf'?
Thank you. Your human identification and correction of the error illustrates my points nicely. I Rest My Case.
At last, a thoughtful and nuanced assessment of the use of AI. If we are going to use AI (i.e. machine-learning) for such tasks, then it really needs to be purpose-built and trained on quality data that is specific to the task that it needs to perform, not reliant on the commercial AI tools that have been so badly trained that their output is untrustworthy and their negative environmental impact huge (not to mention being ethically unsound). Such a tool could usefully support the gathering of data and evidence for the assessment, overseen by staff in the institution, although I would rather the assessment exercise itself undertaken by human panels. Should this be the direction that the HE sector chooses to take, then I would like to see such a tool community owned and assessed for it's environmental impact, in addition to accounting for the biases mentioned in the article.
Great to see a nuanced discussion of where and how AI might be implemented in the REF. On the topic of bias, it's worth noting that human assessors are already far from objective. For instance: https://www.pnas.org/doi/epub/10.1073/pnas.2205779119. We also know that people with foreign-sounding names often face disadvantages in publishing and job applications. So while concerns about biased AI are valid, they shouldn’t be a blanket argument against its use. If anything, this could be an opportunity to confront and mitigate systemic biases
Good points!! I personally am a bit cynical about the so-called objectivity or indeed expertise of our human assessors and I think there is a wealth of anecdotal evidence now to support this. Hard evidence is difficult to obtain, for one thing they destroy all their data on the panels so their 'judgments' can not be challenged. But they seem more focused on matters relating to EDI/DEI these days than to research quality. Everyone who has been involved in the process is very cynical as far as I know.
The REF process is not thinking and knows nothing – it must be approached with caution.
The Future Research Assessment Program (FRAP) acknowledged (in its initial decisions report from 2023) the rapid development in AI but concluded based on research conducted in 2022 that it was too early to adopt AI in REF (particularly regarding the use of AI to assess the quality of research output), effectively parking further exploration until after REF 2029. Of course assessment of outputs is only one part of the framework and three years is a long time let alone a further 4 or 5 years. Research England still has an opportunity to set out policy and clear guidance on AI as it relates to the REF in a way like UKRI has set out policy on the use of AI in funding applications.
ADVERTISEMENT