LLMs as jurors: assessing the potential of Large Language Models in legal contexts

Yongjie, Sun; Zappalà, Angelo; Di Maso, Eleonora; Pompedda, Francesco; Nyman, Thomas J.; Santtila, Pekka

LLMs as jurors: assessing the potential of Large Language Models in legal contexts

Lists

Tools

Yongjie, S., Zappalà, A., Di Maso, E., Pompedda, F., Nyman, T. J. ORCID: https://orcid.org/0000-0002-6409-2528 and Santtila, P. (2025) LLMs as jurors: assessing the potential of Large Language Models in legal contexts. Law and Human Behavior. ISSN 1573-661X (In Press)

[thumbnail of LLM as Juror_Manuscript.pdf]

Text - Accepted Version
· Restricted to Repository staff only
1MB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Abstract/Summary

Objectives: We explored the potential of Large Language Models (LLMs) in legal decision-making by replicating Fraser et al.'s (2023) mock jury experiment using LLMs (GPT-4o, Claude 3.5 Sonnet, and GPT-o1) as decision-makers. We investigated LLMs' reactions to factors that influenced human jurors, including defendant race, social status, number of allegations, and reporting delay in sexual assault cases. Hypotheses: We hypothesized that LLMs would show higher consistency than humans, with no explicit but potential implicit biases. We also examined potential mediating factors (race-crime congruence, credibility, Black Sheep Effect) and moderating effects (beliefs about traumatic memory, ease of reporting) explaining LLM decision-making. Methods: Using a 2x2x2x3 factorial design, we manipulated defendant race (Black/White), social status (low/high), number of allegations (one/five), and reporting delay (5/20/35 years), collecting 2304 responses across conditions. LLMs were prompted to act as jurors, providing probability of guilt assessments (0-100), dichotomous verdicts, and responses to mediator and moderator variables. Results: LLMs showed higher average probability of guilt assessments compared to humans (63.56 vs. 58.82) but were more conservative in rendering guilty verdicts (21% vs. 49%). Similar to humans, LLMs demonstrated bias against White defendants and increased guilt attributions with multiple allegations. Unlike humans, who showed minimal effects of reporting delay, LLMs assigned higher guilt probabilities to cases with shorter reporting delays. Mediation analyses revealed that race-crime stereotype congruency and the Black Sheep Effect partially mediated the racial bias effect, while perceived memory strength mediated the reporting delay effect. Conclusion: While LLMs may offer more consistent decision-makings, they are not immune to biases and may interpret certain case factors differently than human jurors.

Item Type:	Article
Refereed:	Yes
Divisions:	Life Sciences > School of Psychology and Clinical Language Sciences > Department of Psychology
ID Code:	122859
Publisher:	American Psychological Association

Deposit Details

University Staff: Request a correction | Centaur Editors: Update this record

University of Reading

CentAUR: Central Archive at the University of Reading

Accessibility navigation

LLMs as jurors: assessing the potential of Large Language Models in legal contexts

Abstract/Summary

Page navigation

See also

Footer navigation