MOLLEO

MOLLEO
Efficient Evolutionary Search Over Chemical Space with Large Language Models

Haorui Wang^*,1, Marta Skreta^*,2,3, Cher-Tian Ser², Wenhao Gao⁴, Lingkai Kong¹, Felix Streith-Kalthoff⁵, Chenru Duan⁶, Yuchen Zhuang¹, Yue Yu¹, Yanqiao Zhu⁷, Yuanqi Du^†,8, Alán Aspuru-Guzik^†,2,3, Kirill Neklyudov^†,9,10, Chao Zhang^†,1

¹Georgia Institute of Technology, ²University of Toronto, ³Vector Institute, ⁴Massachusetts Institute of Technology, ⁵University of Wuppertal, ⁶Deep Principle Inc., ⁷University of California, Los Angeles, ⁸Cornell University, ⁹Université de Montréal, ¹⁰Mila - Quebec AI Institute,
^*Indicates Equal Contribution
^†Indicates Equal Senior-Authorship

Abstract

Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations.

Experiments

Single-objective optimization

Employing any of the three LLMs we tested as genetic operators improves performance over the default Graph-GA and all other baselines. Notably, MOLLEO(GPT-4) outperforms all models in 9 out of 12 tasks, demonstrating its utility in molecular optimization MOLLEO(BioT5) achieves the second-best results out of all the models tested, obtaining a total score close to that of MOLLEO(GPT-4), and has the benefit of being free to use.

Effectiveness of LLMs in GA

In the above figure, we show the fitness distribution of an initial pool of random molecules inhibiting JNK3. We then perform a single round of edits to all molecules in the pool using each LLM and plot the resulting fitness distribution of the edited molecules. We find that the distribution for each LLM shifts to slightly higher fitness values, indicating that LLMs do provide useful modifications. However, the overall objective scores are still low, and so single-step editing is not sufficient.
MY ALT TEXT

We also conduct convergence analysis on several more optimization objectives.

Structure-based Drug Design

Structure-based design aims to design small molecule ligands based on a specific protein target. The evaluation is based on computationally calculated docking scores.

Multi-objective optimization

Multi-objective optimization is a more challenging task, which is are inspired by goals in drug discovery and aim for simultaneous optimization of several objectives. We find that MOLLEO(GPT-4) consistently outperforms the baseline Graph-GA in all three tasks in terms of hypervolume and objective summation.

Case Study

Here is a case study of MOLLEO(GPT-4) on deco_hop task. We display the top-10 candidate molecules across all runs.

MY ALT TEXT

BibTeX

@misc{wang2024efficientevolutionarysearchchemical, title={Efficient Evolutionary Search Over Chemical Space with Large Language Models}, author={Haorui Wang and Marta Skreta and Cher-Tian Ser and Wenhao Gao and Lingkai Kong and Felix Streith-Kalthoff and Chenru Duan and Yuchen Zhuang and Yue Yu and Yanqiao Zhu and Yuanqi Du and Alán Aspuru-Guzik and Kirill Neklyudov and Chao Zhang}, year={2024}, eprint={2406.16976}, archivePrefix={arXiv}, primaryClass={cs.NE} url={https://arxiv.org/abs/2406.16976}, }

MOLLEO
Efficient Evolutionary Search Over Chemical Space with Large Language Models

MOLLEO uses chemistry-aware LLMs inside mutation and crossover operations to propose new molecules in the evolutionary searching process.

Abstract

Introduction

Experiments

Single-objective optimization

Effectiveness of LLMs in GA

Structure-based Drug Design

Multi-objective optimization

Case Study

BibTeX

MOLLEOEfficient Evolutionary Search Over Chemical Space with Large Language Models

MOLLEO uses chemistry-aware LLMs inside mutation and crossover operations to propose new molecules in the evolutionary searching process.

Abstract

Introduction

Experiments

Single-objective optimization

Effectiveness of LLMs in GA

Structure-based Drug Design

Multi-objective optimization

Case Study

BibTeX

MOLLEO
Efficient Evolutionary Search Over Chemical Space with Large Language Models