Crystal structure generation with autoregressive large language modelingAntunes, L. M., Butler, K. T. and Grau-Crespo, R. ORCID: https://orcid.org/0000-0001-8845-1719 (2024) Crystal structure generation with autoregressive large language modeling. Nature Communications. ISSN 2041-1723 (In Press)
It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing. Abstract/SummaryThe generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective models of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.
[1] Cerqueira, T. F. et al. Identification of Novel Cu, Ag, and Au Ternary Oxides from
Global Structural Prediction. Chemistry of Materials 27, 4562–4573 (2015).
[2] Zhu, B. & Scanlon, D. O. Predicting Lithium Iron Oxysulfides for Battery Cathodes.
ACS Applied Energy Materials 5, 575–584 (2022).
[3] Harper, A. F., Evans, M. L. & Morris, A. J. Computational Investigation of Copper
Phosphides as Conversion Anodes for Lithium-Ion Batteries. Chemistry of Materials
32, 6629–6639 (2020).
[4] Oganov, A. R., Pickard, C. J., Zhu, Q. & Needs, R. J. Structure prediction drives
materials discovery. Nature Reviews Materials 4, 331–348 (2019).
[5] Oganov, A. R. Modern Methods of Crystal Structure Prediction (John Wiley & Sons,
2011).
[6] Pickard, C. J. & Needs, R. High-Pressure Phases of Silane. Physical Review Letters
97, 045504 (2006).
[7] Pickard, C. J. & Needs, R. Ab initio random structure searching. Journal of Physics:
Condensed Matter 23, 053201 (2011).
[8] Oganov, A. R. & Glass, C. W. Crystal structure prediction using ab initio evolu-
tionary techniques: Principles and applications. The Journal of Chemical Physics
124, 244704 (2006).
[9] Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine
learning for molecular and materials science. Nature 559, 547–555 (2018).
[10] Podryabinkin, E. V., Tikhonov, E. V., Shapeev, A. V. & Oganov, A. R. Accelerating
crystal structure prediction by machine-learning interatomic potentials with active
learning. Physical Review B 99, 064114 (2019).
[11] Choudhary, K. et al. Recent advances and applications of deep learning methods in
materials science. npj Computational Materials 8, 59 (2022).
[12] Goodfellow, I. et al. Generative Adversarial Nets. In Ghahramani, Z.,
Welling, M., Cortes, C., Lawrence, N. & Weinberger, K. (eds.) Advances
in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.,
2014). URL https://proceedings.neurips.cc/paper_files/paper/
2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
[13] Court, C. J., Yildirim, B., Jain, A. & Cole, J. M. 3-D Inorganic Crystal Struc-
ture Generation and Property Prediction via Representation Learning. Journal of
Chemical Information and Modeling 60, 4518–4535 (2020).
[14] Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. Crystal Diffu-
sion Variational Autoencoder for Periodic Material Generation. arXiv preprint
arXiv:2110.06197 (2021).
[15] Yan, D., Smith, A. D. & Chen, C.-C. Structure prediction and materials design with
generative neural networks. Nature Computational Science 3, 572–574 (2023).
[16] Alverson, M. et al. Generative adversarial networks and diffusion models in material
discovery. Digital Discovery 3, 62–80 (2024).
[17] Chen, L., Zhang, W., Nie, Z., Li, S. & Pan, F. Generative models for inverse design
of inorganic solid materials. J. Mater. Inform 1, 4 (2021).
[18] Cao, Y. et al. A Comprehensive Survey of AI-Generated Content (AIGC): A History
of Generative AI from GAN to ChatGPT. arXiv preprint arXiv:2303.04226 (2023).
[19] Vaswani, A. et al. Attention Is All You Need. Advances in Neural Information
Processing Systems 30 (2017).
[20] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. Im-
proving Language Understanding by Generative Pre-Training. Tech. Rep.,
OpenAI (2018). URL https://cdn.openai.com/research-covers/
language-unsupervised/language_understanding_paper.pdf.
[21] Introducing ChatGPT. https://openai.com/blog/chatgpt. OpenAI Blog.
Accessed: 2024-10-07.
[22] Liu, Y. et al. Generative artificial intelligence and its applications in materials sci-
ence: Current situation and future perspectives. Journal of Materiomics 9, 798–816
(2023).
[23] Bran, A. M., Cox, S., White, A. D. & Schwaller, P. ChemCrow: Augmenting large-
language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023).
[24] Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large
language models for predictive chemistry. Nature Machine Intelligence 6, 161–169
(2024).
[25] Xie, T. et al. Large Language Models as Master Key: Unlocking the Secrets of
Materials Science with GPT. arXiv preprint arXiv:2304.02213 (2023).
[26] Fu, N. et al. Material transformers: deep learning language models for generative
materials design. Machine Learning: Science and Technology 4, 015001 (2023).
[27] Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science
and chemistry: a reflection on a large language model hackathon. Digital Discovery
2, 1233–1250 (2023).
[28] Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research
with large language models. Nature 624, 570–578 (2023).
[29] Flam-Shepherd, D. & Aspuru-Guzik, A. Language models can generate molecules,
materials, and protein binding sites directly in three dimensions as XYZ, CIF, and
PDB files. arXiv preprint arXiv:2305.05708 (2023).
[30] Hall, S. R., Allen, F. H. & Brown, I. D. The crystallographic information file (CIF):
a new standard archive file for crystallography. Acta Crystallographica Section A:
Foundations of Crystallography 47, 655–685 (1991).
[31] Chen, M. et al. Generative Pretraining from Pixels. In International Conference on
Machine Learning, 1691–1703 (PMLR, 2020).
[32] Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph Networks as a Universal
Machine Learning Framework for Molecules and Crystals. Chemistry of Materials
31, 3564–3572 (2019).
[33] Toshniwal, S., Wiseman, S., Livescu, K. & Gimpel, K. Chess as a Testbed for
Language Model State Tracking. In Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 36, 11385–11393 (2022).
[34] Li, K. et al. Emergent World Representations: Exploring a Sequence Model Trained
on a Synthetic Task. In The Eleventh International Conference on Learning Repre-
sentations (2023). URL https://openreview.net/forum?id=DeG07_TcZvT.
[35] Coulom, R. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search.
In International Conference on Computers and Games, 72–83 (Springer, 2006).
[36] Browne, C. B. et al. A Survey of Monte Carlo Tree Search Methods. IEEE Trans-
actions on Computational Intelligence and AI in games 4, 1–43 (2012).
[37] Brown, T. et al. Language Models are Few-Shot Learners. Advances in Neural
Information Processing Systems 33, 1877–1901 (2020).
[38] Antunes, L. M., Grau-Crespo, R. & Butler, K. T. Distributed representations of
atoms and materials for machine learning. npj Computational Materials 8, 44 (2022).
[39] Onwuli, A., Hegde, A. V., Nguyen, K. V., Butler, K. T. & Walsh, A. Element
similarity in high-dimensional materials representations. Digital Discovery 2, 1558–
1564 (2023).
[40] Jiao, R. et al. Crystal Structure Prediction by Joint Equivariant Diffusion. arXiv
preprint arXiv:2309.04475 (2023).
[41] Jiao, R., Huang, W., Liu, Y., Zhao, D. & Liu, Y. Space Group Constrained Crystal
Generation. arXiv preprint arXiv:2402.03992 (2024).
[42] Yang, M. et al. Scalable Diffusion for Materials Generation. arXiv preprint
arXiv:2311.09235 (2023).
[43] Gruver, N. et al. Fine-Tuned Language Models Generate Stable Inorganic Materials
as Text. arXiv preprint arXiv:2402.04379 (2024).
[44] Touvron, H. et al. LLaMA: Open and Efficient Foundation Language Models. arXiv
preprint arXiv:2302.13971 (2023).
[45] C¸ i¸cek, ¨O., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net:
Learning Dense Volumetric Segmentation from Sparse Annotation. In Medical Image
Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International
Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, 424–432
(Springer, 2016).
[46] Ho, J. et al. Video Diffusion Models. Advances in Neural Information Processing
Systems 35, 8633–8646 (2022).
[47] Castelli, I. E. et al. New cubic perovskites for one- and two-photonwater splitting
using the computational materials repository. Energy & Environmental Science 5,
9034–9043 (2012).
[48] Castelli, I. E. et al. Computational screening of perovskite metal oxides for optimal
solar light capture. Energy & Environmental Science 5, 5814–5819 (2012).
[49] Pickard, C. J. AIRSS Data for Carbon at 10GPa and the C+N+H+O System
at 1GPa. https://archive.materialscloud.org/record/2020.0026/v1
(2020).
[50] Jain, A. et al. Commentary: The Materials Project: A materials genome approach
to accelerating materials innovation. APL Materials 1, 011002 (2013).
[51] Baird, S. mp-time-split. https://github.com/sparks-baird/
mp-time-split (Accessed in 2024).
[52] Mazet, T., Welter, R. & Malaman, B. A study of the new ferromagnetic YbMn6Sn6
compound by magnetization and neutron diffraction measurements. Journal of Mag-
netism and Magnetic Materials 204, 11–19 (1999).
[53] Pamplin, B. A systematic method of deriving new semiconducting compounds by
structural analogy. Journal of Physics and Chemistry of Solids 25, 675–684 (1964).
[54] Davies, D. W. et al. Computational Screening of All Stoichiometric Inorganic Ma-
terials. Chem 1, 617–627 (2016).
[55] Zagorac, D., M¨uller, H., Ruehl, S., Zagorac, J. & Rehme, S. Recent developments
in the Inorganic Crystal Structure Database: theoretical crystal structure data and
related features. Journal of Applied Crystallography 52, 918–925 (2019).
[56] Hyde, P. et al. Lithium Intercalation into the Excitonic Insulator Candidate
Ta2NiSe5. Inorganic Chemistry 62, 12027–12037 (2023).
[57] Ponou, S., Lidin, S. & Mudring, A.-V. Optimization of Chemical Bonding through
Defect Formation and Ordering–The Case of Mg7Pt4Ge4. Inorganic Chemistry 62,
8519–8529 (2023).
[58] Gonz´alez-L´opez, J., Cockcroft, J. K., Fern´andez-Gonz´alez, A., Jimenez, A. & Grau-
Crespo, R. Crystal structure of cobalt hydroxide carbonate Co2CO3(OH)2: density
functional theory and X-ray diffraction investigation. Acta Crystallographica Section
B: Structural Science, Crystal Engineering and Materials 73, 868–873 (2017).
[59] Speech Understanding Systems. Summary of Results of the Five-Year Research Effort
at Carnegie-Mellon University. Tech. Rep. 1529, Carnegie-Mellon Univ Pittsburgh
PA Dept Of Computer Science (1977).
[60] Chaffin, A., Claveau, V. & Kijak, E. PPL-MCTS: Constrained Textual Generation
Through Discriminator-Guided MCTS Decoding. In Carpuat, M., de Marneffe, M.
& Ru´ız, I. V. M. (eds.) Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies,NAACL 2022, Seattle, WA, United States, July 10-15, 2022, 2953–2967
(Association for Computational Linguistics, 2022).
[61] Rosin, C. D. Multi-armed Bandits with Episode Context. Annals of Mathematics
and Artificial Intelligence 61, 203–230 (2011).
[62] Silver, D. et al. Mastering the game of Go with deep neural networks and tree search.
Nature 529, 484–489 (2016).
[63] Choudhary, K. & DeCost, B. Atomistic Line Graph Neural Network for improved
materials property predictions. npj Computational Materials 7, 185 (2021).
[64] Kusaba, M., Liu, C. & Yoshida, R. Crystal structure prediction with machine
learning-based element substitution. Computational Materials Science 211, 111496
(2022).
[65] Wei, L. et al. TCSP: a Template-Based Crystal Structure Prediction Algorithm for
Materials Discovery. Inorganic Chemistry 61, 8431–8439 (2022).
[66] Fredericks, S., Parrish, K., Sayre, D. & Zhu, Q. PyXtal: A Python library for crystal
structure generation and symmetry analysis. Computer Physics Communications
261, 107810 (2021).
[67] Avery, P. & Zurek, E. RandSpg: An open-source program for generating atomistic
crystal structures with specific spacegroups. Computer Physics Communications
213, 208–216 (2017).
[68] Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85
(2023).
[69] Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT press,
2018).
[70] Ziegler, D. M. et al. Fine-Tuning Language Models from Human Preferences. arXiv
preprint arXiv:1909.08593 (2019).
[71] Illustrating Reinforcement Learning from Human Feedback (RLHF). https://
huggingface.co/blog/rlhf. Accessed: 2023-07-05.
[72] Kang, S. et al. Accelerated identification of equilibrium structures of multicomponent
inorganic crystals using machine learning potentials. npj Computational Materials
8, 108 (2022).
[73] Chen, C. & Ong, S. P. A universal graph deep learning interatomic potential for the
periodic table. Nature Computational Science 2, 718–728 (2022).
[74] Pausewang, G. & R¨udorff, W. ¨Uber Alkali-oxofluorometallate der ¨Ubergangsmetalle.
A′3MeOxF6-x-Verbindungen mit x = 1, 2, 3. Zeitschrift f¨ur anorganische und allge-
meine Chemie 364, 69–87 (1969).
[75] Hegde, V. I. et al. Quantifying uncertainty in high-throughput density functional
theory: A comparison of AFLOW, Materials Project, and OQMD. Physical Review
Materials 7, 053805 (2023).
[76] Ye, W., Lei, X., Aykol, M. & Montoya, J. H. Novel inorganic crystal structures
predicted using autonomous simulation agents. Scientific Data 9, 302 (2022).
[77] Antunes, L. M. et al. Machine Learning Approaches for Accelerating the Discovery of
Thermoelectric Materials. In Machine Learning in Materials Informatics: Methods
and Applications, 1–32 (ACS Publications, 2022).
[78] Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials Design and
Discovery with High-Throughput Density Functional Theory: The Open Quantum
Materials Database (OQMD). JOM 65, 1501–1509 (2013).
[79] Draxl, C. & Scheffler, M. The NOMAD laboratory: from data sharing to artificial
intelligence. Journal of Physics: Materials 2, 036001 (2019).
[80] Ong, S. P. et al. Python Materials Genomics (pymatgen): A robust, open-source
python library for materials analysis. Computational Materials Science 68, 314–319
(2013).
[81] Liu, P. J. et al. Generating Wikipedia by Summarizing Long Sequences. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC,
Canada, April 30 - May 3, 2018, Conference Track Proceedings (2018).
[82] Togo, A. & Tanaka, I. Spglib: a software library for crystal symmetry search. arXiv
preprint arXiv:1808.01590 (2018).
[83] Ward, L. et al. Matminer: An open source toolkit for materials data mining. Com-
putational Materials Science 152, 60–69 (2018).
[84] Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made
simple. Physical review letters 77, 3865 (1996).
[85] Jain, A. et al. A high-throughput infrastructure for density functional theory calcu-
lations. Computational Materials Science 50, 2295–2310 (2011).
[86] Horton, M. et al. Crystal Toolkit: A Web App Framework to Improve Usabil-
ity and Accessibility of Materials Science Research Algorithms. arXiv preprint
arXiv:2302.06147 (2023).
[87] Antunes, L., Butler, K. & Grau-Crespo, R. Supporting data for: Crystal Structure
Generation with Autoregressive Large Language Modeling (2024). URL https:
//doi.org/10.5281/zenodo.10642388.
[88] Creative Commons Attribution 4.0 License. https://creativecommons.org/
licenses/by/4.0/. Accessed: 2023-06-26.
[89] Antunes, L. lantunes/CrystaLLM: CrystaLLM v1.0 (2024). URL https://doi.org/10.5281/zenodo.13883399. University Staff: Request a correction | Centaur Editors: Update this record |