Post by Nick Byrd, Ph.D. - NeoDB

1d

Most #LLMs over-generalized scientific results beyond the original articles

...even when explicitly prompted for accuracy!

The #AI was 5x worse than humans, on average!

Newer models were the worst.🤦‍♂️

🔓 Accepted in #RoyalSociety Open #Science: https://doi.org/10.48550/arXiv.2504.00025

Figure 2. Forest plot (based on Table 1) displaying odds ratios (OR) and their 95% confidence intervals for comparisons between LLM-generated summaries, original texts, and human-written summaries (NEJM JW). The plot shows the likelihood of generalized (vs. restricted) conclusions in LLM summaries compared to the corresponding reference texts. Higher ORs reflect stronger overgeneralization tendency. The vertical line at OR = 1 represents no difference from the reference text, indicating the benchmark for fully faithful LLM summaries. Comparisons where error bars overlap this line are not statistically significant.

Figure 3. Comparisons between the raw proportions of scientific articles and human-authored as well as LLM-generated article summaries that contain generalized conclusions, overall algorithmic overgeneralizations, and specific algorithmic overgeneralizations, presented by text source and test condition. Error bars represent standard errors.

".... Original texts and summaries were coded based on whether their result claims contained one or more of the following three types of generalizations:

(1) Generic generalizations (generics). These are present tense generalizations that do not have a quantifier (e.g. ‘many’, ‘75%’) in the subject noun phrase and describe study results as if they apply to whole categories of people, things, or abstract concepts (e.g. ‘parental warmth is protective’) instead of specific or quantified sets of individuals (e.g. study participants). ....

(2) Present tense generalizations. ... When past tense result claims from an original text are turned into present tense in the summary, a broader generalization is conveyed than the author(s) of the original text may have intended.

(3) Action guiding generalizations. ...result claims ... often underlie recommendations ... (e.g. ‘CBT should be recommended for OCD patients’) [that involve] broader generalization than that found in the summarized text because researchers may have deliberately avoided such recommendations due to insufficient evidence to support them.

We tested whether the outputs of the 10 LLMs mentioned above retained the quantified, past tense, or descriptive generalizations of the scientific texts that they summarized, or transitioned to unquantified (generic), present tense, or action guiding generalizations. We defined the latter kind of conclusions collectively as generalized and the former as restricted conclusions."

Edited 1d ago

0 0 1 View Post & Replies See Original