Evaluating firearm examiner testimony using large language models: a comparison of standard and knowledge-enhanced AI systems
Pompedda, F., Santtila, P., Di Maso, E., Nyman, T. J.
It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing. Abstract/SummaryObjective: This study evaluated the decision-making of Large Language Models (LLMs) in interpreting firearm examiner testimony by comparing a standard LLM to one enhanced with forensic science knowledge. Method: Following the experimental paradigm of Garrett et al. (2020), we assessed whether LLMs mirrored human decision patterns and if specialized knowledge led to more critical evaluations of forensic claims. We employed a 2 × 2 × 7 between-subjects design with three independent variables: LLM configuration (standard vs. knowledge-enhanced), cross-examination presence (yes vs. no), and conclusion language (seven variations). Each model condition performed 200 repetitions per scenario. This yielded a total of 5,600 measures of binary verdicts, guilt probability ratings, and credibility assessments. Results: LLMs showed low conviction rates (9.4%) across conditions, with logical variations as a function of the way in which the firearm expert’s conclusion was formulated. Cross-examination produced lower guilt assessments and scientific credibility ratings. Importantly, knowledge-enhanced LLMs demonstrated significantly more conservative evaluations of firearm evidence across all match conditions compared to standard LLMs. Conclusions: LLMs, particularly when enhanced with domain-specific knowledge, showed advantages in evaluating complex scientific evidence compared to human jurors in Garrett et al. (2020), suggesting potential applications for AI systems in supporting legal decision-making.
Deposit Details University Staff: Request a correction | Centaur Editors: Update this record |