AI-Powered Drug Discovery Breakthrough: Universal Molecular Fingerprint Encoding Enables Dramatic Acceleration in Virtual Screening

Jul 18, 2023 11:25 CST Updated Jul 19, 11:08

Zelixir

Protein Structure Prediction and Design Service Platform Provider

Nanyang Technological University

Nanyang Technological University is a world-renowned research university in Singapore. NTU is a member of the Association of Pacific Rim Universities, a founding member of the Global Alliance of AI Academic Institutions, an AACSB-accredited institution, a member of the Association of Professional Schools of International Affairs (APSIA), and a founding member of the Global Alliance of Technological Universities. As a research-intensive university in Singapore, it enjoys global prestige for its research in numerous fields, including nanomaterials, biomaterials, functional ceramics, and polymer materials, and is a comprehensive university that places equal emphasis on engineering and business.

The possibilities in drug discovery are endless; however, the high attrition rate during the drug discovery process is one of the foremost challenges faced in biomedical science. It is estimated that, as of 2020, in the United StatesFDAThe average development cost for each approved drug is as high as 1.3 billion US dollars.

In recent years, Computer-Aided Drug Design (CADD) has become a crucial area in drug development, enabling initial screening of candidate molecules and further optimization. The promising leads identified at its early stages are vital to the drug discovery process. However, CADD faces three major challenges: First, the computational cost is very high; second, the success of a drug reaching the market depends not only on its efficacy for the intended disease but also on many other factors—90% of drugs fail clinical trials. Properties such as ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity), Drug-Drug Interactions (DDI), and side effects significantly influence a drug’s success; third, current CADD technologies often involve using numerous specialized models, each predicting a specific chemical property. When multiple models are stacked, the required computational costs increase exponentially.

Recently, Professor Yuguang Mu from Nanyang Technological University, Dr. Liangzhen Zheng from Zelixir, and Professor Weifeng Li from Shandong University published a research paper titled "Application of variational graph encoders as an effective generalist algorithm in computer-aided drug design" in the Nature sub-journal Nature Machine Intelligence.

The study proposes a single, universal unified model that leverages graph convolutional variational encoders to simultaneously predict multiple properties of small-molecule drugs, such as absorption, distribution, metabolism, excretion, and toxicity, as well as target-specific docking score prediction and drug-drug interactions. This model enables state-of-the-art virtual screening with acceleration advantages of up to two orders of magnitude. The latent space minimization of the variational graph encoder can also be combined with the Pareto optimality principle to accelerate the development of specific drugs, offering the additional benefit of interpretability.

Despite significant progress in molecular property prediction using Computer-Aided Drug Design (CADD), there remains an urgent need for rapid and accurate models. Many of the currently available methods are mostly specialized in predicting specific attributes, leading to incredibly high computational costs when stacking multiple models.

In this study, the research team proposed a Variational Graph Encoder to address the aforementioned challenges faced in CADD. This is a convolutional graph neural network model that incorporates elements of a variational autoencoder and is trained to predict simple molecular descriptors and binary molecular fingerprints rather than reconstructing the input.

Through the intermediate mathematical representation (latent space) of variational graph encoders, surrogate models can be trained to predict more complex properties. Previous work utilizing latent space includes sampling in variational autoencoders to generate effective and selective RIPK1 inhibitors and BRAF inhibitors.

Molecules are encoded into a graphical format, then passed through an autoencoder, and the latent space is used for property prediction via a surrogate model.

In terms of methods, previous work on encoding and decoding SMILES strings within recursive neural networks has unlocked the potential for molecular latent space optimization and prediction. Subsequent work utilizing graph features and adjacency matrix-connected variational autoencoder neural networks has also shown promising results. However, the latter restricts the size of molecules that can be used or generated by the autoencoder, while the former is susceptible to multiple SMILES strings encoding the same molecule. Neither approach involves any node-level convolution.

The solution to this problem is to use an edge-conditioned graph convolutional neural network, which actively deciphers the connections and neighboring atoms in each molecule through convolution. The encoding of fingerprints and chemical descriptors replaces traditional autoencoders, also allowing the encoding of molecules of any size while preserving bond and connectivity information. Thus, by overcoming these limitations, accurate and interpretable models can be achieved in virtual screening, making predictions across datasets from different databases and integrating target-specific scoring functions for virtual screening.

As the overall diversity and accuracy of surrogate models have greatly improved, more challenging multi-objective optimization can be performed using the Pareto optimality principle. When combined with structure-based virtual screening, an acceleration of up to two orders of magnitude can be observed in the initial screening, making it possible to screen vast numbers of molecules that were previously unfeasible. The research team stated that this method transforms computer-aided drug design (CADD) into computer-aided drug engineering (CADE).

The desired molecular properties can be designed through surrogate model optimization, and it can explain how one molecule outperforms another in property prediction.

Overall, the latent space of the variational graph encoder developed in this study possesses surprisingly versatile properties that can be used to predict attributes of highly diverse datasets. Further work will involve strategies to mitigate limiting factors and the application of the algorithm in drug discovery pipelines, including subsequent experimental validation.