Evaluating Chatbot-Generated Summaries of Academic Papers

Part 1 of 3

Given interest in using AI tools for summarization combined with uncertainty about which of the many available AI models are best suited for this task, the AI4LAM Evaluation Working Group (WG) wanted to develop recommendations for researchers about which models were suitable to use to summarize research publications.  WG members include a college librarian, university technologists, a public media archivist, and a postdoctoral researcher at a university, all based in the United States.  Members of the WG chose seven AI language models to summarize research papers and then manually evaluated the models’ summaries relative to the papers’ original abstracts.  The evaluations were conducted in May through August 2025, with models available to WG members as of early 2025. The evaluation scores showed noticeable differences in the average quality of the summaries by model and by paper.  On average, ChatGPT 4 Turbo and Microsoft Copilot Chat (which uses GPT-4) had the highest quality summaries according to our scoring method.  

The evaluation process surfaced larger challenges and open questions about evaluating the outputs of generative AI models that we think are relevant beyond summarization use cases.  We’re sharing them in this three-part blog series in the hopes of sparking discussion and feedback from interested readers!  This first post introduces what we did and why.  The second post summarizes the quantitative results of our evaluation of AI-generated summaries.  The third post summarizes a qualitative analysis of our process of evaluating the AI-generated summaries.

First, we selected five published, peer-reviewed papers from different disciplines:

We selected these open access papers to cover a variety of disciplines: medicine, computer science, history, philosophy, and education.  To make sure the AI models we chose wouldn’t have had any of these papers in their training data (which means they would have had access to the papers’ abstracts), we selected papers that were published after the models had been trained and deployed.  The papers ranged in length from 16-30 pages.  We created versions of each paper that excluded the title, abstract, and author names; the rest of the content of each paper was unchanged.

Each WG member chose a different AI model to upload a PDF of each paper to and prompt for a summary.  The AI models chosen were: 

All models received the following prompt to generate a summary:

Write an academic abstract for the attached research paper. Do not include any information that is not explicit in the research paper.

 The abstract should be a single paragraph of approximately 150–250 words.  Include the following elements:

– *Research context*:  If discussed in the research paper, please mention related previous research and importance.

– *Subject area and topic*:  Briefly mention the subject area and the topic of this research paper.

– *Key findings or claims*:  Describe main thesis, claims, and quantitative or qualitative findings of this research paper.

– *Argument or evidence*:  Describe how the author(s) support their key findings or claims. If discussed, note the methodology.

Aim for clarity and information density.  Do not include citations or references. 

Every model evaluated every paper except HuggingChat with Llama-3.3-70B for the “Investigating New Drugs…” paper by Islam, Ahmed, and Mahfuj et al. (because HuggingChat was taken offline during our experiments).  This gave us 34 unique, AI-generated summaries to evaluate.

Next, we created lists of key points from each paper’s original abstract (ranging from 9 to 17 key points).  The key points were written to contain one piece of information from the abstract.  Success in mentioning the key points was the basis of the scores assigned to each AI model.  We scored summaries by key point, giving a model either a 1 or 0 for if a key point was or was not present.  We collaboratively evaluated one summary per paper to ensure agreement on the key points and alignment among WG members in the interpretation of the paper’s abstract.  Then, for every summary, three members of the WG scored remaining summary’s key points independently.

We also recorded notes about our experience conducting the evaluations as “Qualitative Observations,” guided by four questions:

  1. What was the most difficult point to score and why?
  2. What was the easiest point to score and why?
  3. What did you and your co-reviewers have the most trouble agreeing on, if anything?
  4. Has this process given you any insights on the inner workings of the model(s)?  If any inferences came to mind, please explain.

Check out part 2 of this series for the quantitative results of our evaluation and part 3 for a qualitative analysis of the evaluation process!