Insight

LLMs for Bibliometrics and Research Summaries: Where They Help, Where They Mislead

By Discover RIMS Admin · June 9, 2026 · Updated June 13, 2026

Large language models (LLMs) have entered every part of the research workflow, including bibliometrics and research summarisation. Some applications are genuinely useful; others are dangerous. Telling them apart is a research-office responsibility, because the line between "helpful summary tool" and "uninterpretable evaluation engine" is exactly the line institutional policy needs to draw. This article makes the distinction concrete.

Where LLMs help bibliometrics

LLMs are at their best when they accelerate work humans validate. Useful applications inside a RIMS include: generating plain-language summaries of researcher output (for public profiles, where researchers can review and correct), clustering publications into topical themes for strategic planning, drafting first-pass impact narratives that humans then refine, and triaging which records need closer human attention. None of these involves the LLM producing a number that drives a decision; all involve the LLM speeding up human work.

Where LLMs are dangerous in bibliometrics

Equally important: where LLMs should not be used. LLM-derived research quality scores — a model trained to "rate" outputs on a scale — are opaque, hard to audit, and biased in ways the research community is only beginning to characterise. An influential 2025 Scientometrics opinion paper argues that LLM-based research evaluation, if adopted casually, may change researcher behaviour in ways that erode the integrity of the evidence base. Predictive metrics — "this researcher is likely to publish a high-impact paper next year" — combine the worst of opaque AI with the worst of premature quantification. Comparative ranking of researchers via LLM is exactly what DORA and CoARA caution against.

The transparency problem

The honest case against LLM-based evaluation is not that the models are bad. It is that the scores are uninterpretable. A citation count can be checked: open the paper, count the citations. A field-weighted citation impact can be reproduced: same data, same algorithm, same result. An LLM-derived score cannot be reproduced without running the same prompt against the same model, and even then the answer can drift. A research office cannot defend a decision in front of a promotion panel using a number it cannot explain.

The bias problem

LLMs encode the biases of their training data. Research that is well represented in the training corpus (English-language, well-cited, in high-prestige journals) is over-weighted; research underrepresented (regional journals, non-English work, newer venues) is silently penalised. For institutions whose researchers do important work outside the LLM's comfort zone — including many emerging-economy universities — an LLM-derived score systematically understates their output. The remedy is comprehensive, reconciled coverage: OpenAlex vs Scopus coverage and open-science coverage are not optional for a global research record.

Where LLMs and RIMS fit together responsibly

The pattern: use LLMs for description and discovery; use deterministic metrics on reconciled data for evaluation. Surface LLM outputs visibly (labelled as AI-generated, easy for researchers to correct). Audit periodically. Never let an LLM be the only signal driving a decision. This is the practical operationalisation of responsible AI in research evaluation.

Frequently asked questions

Are LLM summaries on public profiles a good idea? Yes, with researcher review and correction. Visible labelling is essential.

Will LLM-based research evaluation replace citation metrics? Probably not — both because of transparency limits and because citation metrics, used carefully, give institutions a reproducible signal that LLMs cannot.

How does this fit with our DORA commitment? Using LLMs as the primary evaluation signal contradicts DORA. Using them for summarisation and discovery aligns with it.

Where to start

Discover RIMS keeps the reconciled output record on which any responsible AI feature depends — across ORCID, Scopus, OpenAlex, Crossref, and Scimago — so that institutions can adopt AI capabilities incrementally without compromising the integrity of their research-information evidence base.

LLMs for Bibliometrics and Research Summaries: Where They Help, Where They Mislead

Where LLMs help bibliometrics

Where LLMs are dangerous in bibliometrics

The transparency problem

The bias problem

Where LLMs and RIMS fit together responsibly

Frequently asked questions

Where to start

Related reading

Related articles

Internationalisation Metrics: Measuring Global Research Collaboration

Bibliometrics for Ranking Submissions: h-index, FWCI and Citation Impact

AI is Only as Good as the Data Beneath It: The RIMS Data Foundation