Transformer-decoder GPT models for generating virtual screening libraries of HMG-Coenzyme A reductase inhibitors: effects of temperature, prompt-length and transfer-learning strategies

Cafiero, Mauricio

Download

Preview

Text (Open Access)
- Published Version
· Available under License Creative Commons Attribution.

[thumbnail of R2_Statin_GPT_Cafiero_2024.pdf]

Text
- Accepted Version
· Restricted to Repository staff only

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Cafiero, M. ORCID: https://orcid.org/0000-0002-4895-1783 (2024) Transformer-decoder GPT models for generating virtual screening libraries of HMG-Coenzyme A reductase inhibitors: effects of temperature, prompt-length and transfer-learning strategies. Journal of Chemical Information and Modeling, 64 (22). pp. 8464-8480. ISSN 1549-960X doi: 10.1021/acs.jcim.4c01309

Abstract/Summary

Attention-based decoder models were used to generate libraries of novel inhibitors for the HMG-Coenzyme A reductase (HMGCR) enzyme. These deep neural network models were pre-trained on previously synthesized drug-like molecules from the ZINC15 database to learn the syntax of SMILES strings, and then fine-tuned with a set of ~1,000 molecules that inhibit HMGCR. The numbers of layers used for pre-training and fine-tuning were varied to find the optimal balance for robust library generation. Virtual screening libraries were also generated with different temperatures and numbers of input tokens (prompt-length) to find the most desirable molecular properties. The resulting libraries were screened against several criteria, including: IC50 values predicted by a Dense Neural Network (DNN) trained on experimental HMGCR IC50 values, docking scores from AutoDock Vina (via Dockstring), a calculated Quantitative Estimate of Druglikeness (QED), and Tanimoto similarity to known HMGCR inhibitors. It was found that 50/50 or 25/75% pre-trained/fine-tuned models with a non-zero temperature and shorter prompt-lengths produced the most robust libraries, and the DNN-predicted IC50 values had good correlation with docking scores and statin-similarity. 42% of generated molecules were classified as statin-like by k-means clustering, with the rosuvastatin-like group having the lowest IC50 values and lowest docking scores.

Altmetric Badge

Item Type	Article
URI	https://centaur.reading.ac.uk/id/eprint/119218
Identification Number/DOI	10.1021/acs.jcim.4c01309
Refereed	Yes
Divisions	Life Sciences > School of Chemistry, Food and Pharmacy > Department of Chemistry
Publisher	American Chemical Society
Download/View statistics	View download statistics for this item

Download Statistics

Downloads

Downloads per month over past year

Deposit Details

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	06 Nov 2024 10:22	Date item deposited into CentAUR
Last Modified:	28 Nov 2025 08:00	Date item last modified