Evaluating Chatbot-Generated Summaries of Academic Papers: A Quantitative Analysis of the Summaries

Part 2 of 3

One of the major uses of large language models (LLMs) is to summarize text. We wanted to understand how the model summaries perform in terms of conveyance of ideas rather than in terms of linguistic similarity calculated through methods such as ROUGE and BLEU.

Attempts at evaluating LLM-generated answers are done in a number of setups. A common one is to have a model generate two responses or two models each generate a response side by side, and have the user evaluate the given text. However, in those cases the users are the ones asking the question, so it is likely that they will not be the experts on which response is better. It also results in a relative rather than an absolute rubric for model performance.

We wanted to create a scoreboard for common LLMs and a consistent dataset against which newer models can be tested. We wanted to determine how useful LLMs are in summarizing peer-reviewed literature, one of the most used resources that libraries provide to their patrons. When testing the summarization of a peer-reviewed, published paper, in theory, we are already provided with a gold standard–the author-written abstract. That became the keystone of our strategy.

The methodology is described in part 1 of this blog post. To illustrate the process of normalizing scores as fractions consider this example. If a summary included seven of the 10 key points identified from the abstract, the summary was scored as 0.7. If a summary included 10 key points out of the 15 identified from the abstract, it was scored as 0.75. We compared the averages of different models across the summarized papers to determine how they performed. We also looked at how well overall different papers were summarized by the models, and provide our thinking on the differences there as well.

Table I contains the evaluation of ChatGPT 4 Turbo accessed through the web interface. The control abstract was used to generate Original Abstract Key Points. The content from the Summary column was evaluated for the presence of these points, and they were marked with 1 or 0. In this example nine out of 11 key points were present in the summary, which resulted in a model score of 0.82 for this paper. The generated summaries and scoring dataset is available at [Dryad DOI goes here.]

Table I. Summary of “PaperBench: Evaluating AI’s Ability to Replicate AI Research” (Starace et al., 2025) generated by  ChatGPT 4 Turbo via the web interface assessed against the key points the working group identified in the paper abstract used as the gold standard for the summarization of this paper.

SummaryOriginal Abstract Key PointsControl (paper-provided) abstract
This research paper introduces PaperBench, a benchmark designed to evaluate AI agents’ ability to autonomously replicate cutting-edge machine learning (ML) research. Situated within the context of accelerating ML progress through AI autonomy, the study addresses a gap in prior benchmarks, which often rely on agents using existing codebases rather than generating full replications from scratch. PaperBench comprises 20 empirically grounded research papers from ICML 2024, each accompanied by a detailed rubric co-developed with the original authors to assess replication success. These rubrics decompose replication tasks into 8,316 granular outcomes across categories like code development, execution, and result matching. The benchmark also includes JudgeEval, a dataset used to validate automated grading by large language models (LLMs). Evaluation results show that current frontier models, such as Claude 3.5 Sonnet and OpenAI’s o1, demonstrate some capability, with top models achieving replication scores up to 21.0% under default settings and 26.0% under more guided agent scaffolding. However, performance still lags behind human ML PhD baselines, which reach 41.4% on a subset of tasks. The paper employs a rigorous experimental setup, including isolated execution environments and rule-based restrictions, to ensure fair assessment of agent autonomy. PaperBench thus provides a robust and challenging benchmark for measuring AI models’ readiness to perform long-horizon, real-world ML research tasks, with implications for both AI capability forecasting and safety evaluation.
1 – PaperBench is introduced as a benchmark to evaluate how well AI agents can replicate state-of-the-art AI research.
1 – The task for the AI agents is to replicate 20 specific ICML 2024 Spotlight and Oral papers from scratch
1 – Replication involves understanding them, coding, and successfully executing experiments.
1 – To evaluate objectively, rubrics were developed that break down each replication task into smaller, clearly defined subtasks for grading.
1 – PaperBench includes a large number of evaluation points: 8,316 individually gradable tasks.
1 – The rubrics were created with the help of the original authors of the ICML papers to ensure they are accurate and realistic.
0 – An LLM-based AI judge was developed for scalable, automatic grading of replication attempts
1 – This AI judge’s accuracy was also tested using a separate benchmark.
1 – When tested, the best AI model Claude 3.5 Sonnet (new) scored an average of 21.0% on PaperBench.
1 – Top ML PhD students were also tested and AI models currently do not perform better than these human experts.
0 – The code for PaperBench is being made publicly available (open-sourced) to help future research on AI agents’ engineering abilities.
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller subtasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLMbased judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents.

While the model summaries did not include all key points from the original abstracts, ChatGPT scored the highest, with Claude and Gemini performing close together (Table II). Llama closed the ranking. The overall impression of the group was that the summaries generated through different interfaces to ChatGPT, Gemini and Claude were clear, contained sufficient information, and in some cases made the content clearer to readers.

Table II. Large Language Models’ performance in their ecosystem in summarizing selected papers. The average summary score represents the fraction of key points from the original abstract being present in the LLM-generated summary.

ModelAverage summary score(0 to 1)Number of papers assessed per model
Claude Sonnet 3.70.585
Gemini 2.0 -Flash-0010.545
Microsoft Copilot chat (based on GPT-4)0.645
ChatGPT 4 Turbo (web interface)0.695
HuggingChat with Llama-3.3-70B0.464

We also discovered that the mode of access can greatly affect the performance of a model. For instance, our access to a hosted Deepseek-41:70b was mediated via a university hosted Open Webui interface. Even after changing the parameters for document ingest, the model response would at times be “Unfortunately, I cannot access or review attached documents, including research papers.” We knew the issue wasn’t the model itself because we were able to generate a high-scoring summary of the same paper using the Deepseek chat interface.  We confirmed with the IT department that the issue was that PDF parsing was silently failing, without informing the user that this was the case. Therefore, we removed the Deepseek r1:70b score from this report, as the setup did not allow appropriate evaluation. We also tested Deepseek r1:70b through the chat interface, and ascertained that that setup was robust. However, since we did not test it on multiple works, that result is also not reported. The issue we encountered highlights that differences within systems that are running LLMs could lead to wide variations in results, and opacity of how they are processing uploaded documents could be misleading or otherwise problematic. It is very easy to attempt experimentation using default parameters, like we did, and not realise that some reconfiguration is required.

Finally, there was a difference in how well the models performed in summarizing each paper. We agreed that even though the models’ summarization of the “The Impact of Large Language Models on Computer Science Student Writing”  did not match the original abstract, these summaries were of better quality and readability than the original abstract. This challenges the use of the abstract as the gold standard when the quality of the abstracts is variable. On the other hand, this perhaps supports a previous observation that LLMs are useful tools that can enhance human performance in an area where improvement is needed. It could potentially be a diagnostic tool: if, for example, the overlap between the model summary and the human abstract is around 30% or less, the abstract may need to be reviewed with an eye for improvements.

Table III. Summarization scores per paper. Each LLM-generated summary was represented as the fraction of key points in the original abstract. Those fractions were then averaged across tested models, listed in Table II.

PaperAverage performance across all models
“Investigating New Drugs from Marine Seaweed Metabolites for Cervical Cancer Therapy by Molecular Dynamic Modeling Approach” (Islam, Ahmed, and Mahfuj et al., 2025)0.68
“PaperBench: Evaluating AI’s Ability to Replicate AI Research” (Starace et al., 2025)0.68
“‘A Womb of My Own’”: Women’s Bodies and Medicine in Early Modern England” (Black, 2025)0.72
“Strong Social Anti-Reductionism Reexamined” (Matsumoto, 2025)0.56
“The Impact of Large Language Models on Computer Science Student Writing” (Zdravkova and Ilijoski, 2025) 0.25

We speculated as to what could account for the gap between the LLM summaries and the original abstracts. One possibility is that abstracts written by humans for humans set up “the big picture” to give context for the importance and relevance of their work. However, that big picture may be a small fraction of the paper text, and therefore excluded from the generated summary, despite our prompt requesting inclusion of “Research context:  If discussed in the research paper, please mention related previous research and importance.”

Another area to discuss is the subject-matter expertise. The models appear to have worked comparably well on papers in chemical biology, computer science and law. However, the work in philosophy (“Strong Social Anti-Reductionism Reexamined” (Matsumoto, 2025)) was somewhat harder for models to summarize. We think this could be due to its multiple layers and logical relationships among points. Given that we only had one paper per subject, it is premature to draw conclusions. Further study could be very telling. We also want to note that interpreting a paper’s results based on a model-generated summary in some cases may be inappropriate. For instance, one of the summaries names RL379 as the most promising compound in the Islam, et al., 2025 paper, which seems to misinterpret RMSD results.

On the flip side of subject expertise, a number of times members of the group asked themselves if they were interpreting the content of the original abstract correctly. The way we tried to address that concern is by analyzing at least one summary per paper as a group. Some of us considered Gellmann amnesia–whether we dismiss the implications of our critical judgments in the area that we know when trying to evaluate the information of the area we don’t know. The scores are available for full transparency. 

We also calculated ROUGE scores to see what an automated comparison between the abstract and the LLM summary looks like. In Table IV we show the results for R1 precision as those were among the highest scores observed in the calculation, presenting R Lsum scores as it is our understanding that R L and R Lsum metrics are better suited to reflect summary performance. However, we do not see the pattern of our scoring reflected in the ROUGE scores. This leads us to conclude that the key point dataset that we developed can be used to assess automatic summarization metrics.

Table IV. R1 precision scores per model per paper

Claude Sonnet 3.7Gemini 2.0 -Flash-001Microsoft Copilot chat (based on GPT-4)ChatGPT 4 Turbo (web interface)HuggingChat with Llama-3.3-70B
HPV Seaweed0.470.530.570.54N/A
Paperbench0.550.520.430.460.28
Abortion history0.520.440.440.470.46
Presumptivism0.650.750.490.540.51
Large language models0.550.550.470.450.49

Conclusion

We find the tested large language models to generate summaries helpful for understanding the used articles, in some cases adding clarity to the content provided by authors. Based on our experiments’ results, we recommend that, when applying our scoring method to other model-generated summaries, if a model-generated summary score is 0.3 or lower that the model environment settings be checked for failures and that the original abstract be reviewed. At the same time, it is premature to use summaries generated by the models we tested conclusively and independently of human judgement and expertise. Future experiments could compare model-generated summaries of papers to the full text of the papers. While the AI field moves quickly  our scoring method  will remain relevant for assessing new generations of LLMs.