Evaluating Chatbot-Generated Summaries of Academic Papers: A Qualitative Analysis of the Evaluation Process

Part 3 of 3

This is the third (and final!) post in a blog series about the AI4LAM Evaluation Working Group’s evaluation of chatbot-generated summaries of academic research papers.  In the first post, we explained why we were interested in chatbot-generated summaries and how we set up an experiment to evaluate them.  In the second post, we reported on the results of that evaluation.  In this post, we focus on the process of conducting the evaluation, reflecting on challenges we faced and questions the process raised.

As a reminder, our evaluation of chatbot-generated summaries involved comparing the summaries of papers to the papers’ abstracts.  We defined around 10 key points in each abstract and gave a summary a 1 if it contained the key point and a 0 otherwise.  We also recorded qualitative observations about our experience using this scoring method.  During weekly meetings, we discussed the scores we’d given to our assigned summaries, adding to our qualitative observations with notes about our agreements and disagreements.

We recorded qualitative observations about 32% of the scored summaries.  4 out of the 5 WG members recorded qualitative observations, ranging from 3 to 14 per member.  We recorded 33 total observations across our 87 summary scores (3 WG members scored each summary).  However, 5 of those observations note that the wrong paper had been uploaded to a chatbot for summarizing, so we excluded those from our qualitative analysis and reflection.

While the quantitative analysis of our evaluation suggests that chatbots can be promising tools for summarizing research papers, the qualitative analysis captures nuances about the evaluation process that aren’t reflected in our quantitative scoring process.  Some of the nuances relate to the nature of manual work: during weekly meetings, we sometimes recognized an inconsistency in how strictly we’d interpreted summaries and abstracts when applying our scoring method.  Largely, though, these nuances come from subjective aspects of language and divergence of interpretations.

Sometimes, chatbot-generated summaries included key points from paper abstracts in unexpected ways.  For example, some chatbots’ summaries provided a higher level overview that addressed the idea of a key point but didn’t contain the same amount of detail or specificity.  Other chatbots’ summaries included a key point but, instead of including it in a single sentence as it appeared in a paper’s abstract, it was spread across multiple sentences of the summary.

We also encountered scoring difficulties when chatbot summaries included but reworded a key point.  Sometimes, the rewording seemed useful because it made the purpose of the paper understandable to a broader audience by using less domain-specific terminology.  Other times, though, the rewording made what had been explicitly communicated in an abstract more implicit, de-emphasizing an idea or argument that seemed significant to a paper.  While we felt that, in certain cases, the rewording was acceptable or even beneficial, at other times, we felt that the original terminology of the abstract should have been used because it was foundational to the paper.

Of course, we also didn’t always agree on how important certain terminology was!  We had a long discussion about summaries of the “‘A Womb of My Own’” paper (Black, 2025), debating whether “bodily autonomy” and “reproductive autonomy” were the same.

Lastly, there were times we felt that the chatbot-generated summaries misrepresented a paper.  For example, one key point was “This AI judge’s accuracy was also tested using a separate benchmark,” but in the paper, the AI judge’s accuracy was tested using the same benchmark that had been used in an earlier part of the paper.  Though these sorts of misrepresentations occurred in a minority of the cases, it poses a notable risk.  If a person uses a chatbot to summarize text-based documents that they haven’t read, as a way to quickly understand the purpose or significance of what’s written in them, they may end up with a slightly or entirely different idea about the paper’s findings or arguments.

The significance of certain key points being absent or misrepresented in a summary was not always reflected in the quantitative scores.  For example, one summary of the “‘A Womb of My Own’” paper (Black, 2025) covered 72% of the key points from an abstract but a qualitative observation noted, “I think it’s pretty significant that it misses explicitly stating that women’s voices were left out of Alito’s Majority Opinion!”

Though we had gone through several iterations of evaluation approaches before settling on the process of scoring abstracts’ key points with 0s and 1s, the challenges we encountered using this scoring method raised questions about how it might be improved:

The challenge with any evaluation of generative AI tools, of course, is the pace at which they’re being developed.  This is further complicated by the lack of transparency about how and when the tool or its underlying model will be updated or replaced.  During our summarization experiment, the HuggingChat platform was taken offline (seemingly permanently) and we found, during certain attempts to verify whether or not the correct PDF had been uploaded to a chatbot for summarizing, that we could not always trace back to verify how we’d performed the uploading and prompting processes.  

One of the Working Group members also had difficulty finding out what exactly their institution had purchased in terms of tool functionality and capabilities, eventually learning about features that could be customized that led to better results in our summarization experiment.  As another Working Group member put it, chatbots and other generative AI platforms provide a path into an information ecosystem.  It’s generally not clear what exactly that information ecosystem includes, though, or what the path through it might be missing. 

Our evaluation experiment brought home the disconnect between capabilities of so-called “state of the art” AI models and capabilities that are useful in real-world use cases of AI..  This raised two questions:

Thinking about the implications of broken evaluation approaches and ever-changing research tools led us to four final questions:

If you’ve conducted similar experiments or faced similar challenges, we’d love to hear from you!  You can also join our AI4LAM Slack group if you’re interested in joining future evaluations of AI tools or would like to keep up with our discussion of the topic.