Home Evolutionary Scale Files IPO Prospectus: Pioneering AI-Driven Protein Design with ESM3, Simulating Over 500 Million Years of Evolution

Evolutionary Scale Files IPO Prospectus: Pioneering AI-Driven Protein Design with ESM3, Simulating Over 500 Million Years of Evolution

Oct 10, 2024 07:59 CST Updated 08:00
EvolutionaryScale

Developer of Bio-AI Models

AI-powered large biological models are emerging as a pivotal force driving advancements in the life sciences. By providing in-depth analysis and predictions of biological data, these models significantly accelerate processes in drug discovery, genomics research, and disease diagnosis. They are capable of processing and interpreting vast datasets of biological information, uncovering the patterns and mechanisms underlying biological phenomena, thereby offering new perspectives and strategies for disease treatment and biotechnological innovation.

 

Currently, in drug development, artificial intelligence models can predict the activity of drug molecules and optimize drug design, thereby shortening the time required to bring drugs from the laboratory to the market. In the field of synthetic biology, large AI models can design and optimize novel proteins that may not exist in nature, thus providing new biological components for synthetic biology. They enhance the performance of existing proteins through machine learning algorithms, which is crucial for developing efficient biocatalysts and precise biosensors.

 

In June 2024, the artificial intelligence company Evolutionary Scale released its first large AI model, ESM3, which is regarded within the field as highly transformative. As Evolutionary Scale’s flagship product, the ESM3 model simulates over 500 million years of natural evolution to generate functional proteins with novel sequences. This breakthrough not only demonstrates ESM3’s immense potential in generating new proteins but also opens up new possibilities for research and applications in synthetic biology.

 

Use for more than1024 FLOPSand98 BillionParameterize Model Training


EvolutionaryScale was founded in July 2023, dedicated to developing artificial intelligence technologies to advance research and applications in the life sciences, particularly by leveraging biology-based approaches to develop the ESM3 model for understanding and generating protein sequences, structures, and functions.

 

Evolutionary Scale is led by Alexander Rives, who previously oversaw Meta AI’s protein folding research project at Meta. After the tech giant shut down the project, he departed with his team to found Evolutionary Scale, with eight founding employees all coming from the same department. Shortly after its establishment, Evolutionary Scale secured $40 million in seed funding. In June 2024, the company raised an additional $142 million, backed by prominent venture capital firms and tech giants. The round was led by Nat Friedman, Daniel Gross, and Lux Capital, with participation from Amazon, NVentures (NVIDIA’s venture capital arm), and angel investors.

 

EvolutionaryScale aims to become a cutting-edge AI research laboratory focused on the life sciences, dedicated to launching large language models at the forefront of biology. The company has developed an artificial intelligence model based on the Transformer architecture, similar to OpenAI’s GPT-4, but trained on protein molecular data to predict the structures of other unknown proteins. Furthermore, they have used this model to create a database that now contains 700 million possible 3D protein structures, providing a key piece of the puzzle for developing disease-curing drugs, pollution-clearing microbes, or alternative methods for producing industrial chemicals.

 

One year after its founding, the company’s training model had already taken shape: ESM3, a biological generative model for inferring protein sequences, structures, and functions.

 

ESM3 was trained on Earth’s natural diversity, with training data encompassing billions of proteins from diverse environments ranging from the Amazon rainforest to the deep ocean and extreme habitats such as hydrothermal vents, as well as microbes found in soil.

 

Built on the world’s highest-throughput GPU cluster, ESM3 leads in terms of parameters, computational power, and data. The model utilizes over 1x1024Trained with FLOPS and 98 billion parameters, its computational power far exceeds that of existing supercomputers. Meanwhile, the model boasts extremely high capacity, enabling it to capture and learn even the most subtle patterns in data. Therefore, ESM3 can even be regarded as the most computationally intensive model ever used for training biological models.

 

According to reports, ESM3 is the first AI model created by EvolutionaryScale, an order of magnitude larger than the previous ESM2 model, and inherently multimodal and generative.

 

The goal of creating ESM3 is to reason about protein sequences, structures, and functions. To achieve this, Evolutionary Scale converts three-dimensional structures and functions into a discrete alphabet and develops a method to represent each three-dimensional structure as a sequence of letters. This enables large-scale training of ESM3, thereby unlocking emerging generative capabilities. The vocabulary of ESM3 connects sequences, structures, and functions within the same language model.

 

The training objective of ESM3 is relatively straightforward. For each protein, EvolutionaryScale extracts, annotates, and partially masks its sequence, structure, and function. ESM3’s task is to predict the masked positions using a masked language modeling objective inspired by natural language processing models. To accomplish this, ESM3 must learn to deeply understand the interplay among sequence, structure, and function within the data. When scaled to billions of proteins and billions of parameters, ESM3 learns to simulate protein evolution and make predictions.

 

Generate novel protein sequences, simulate over5The Natural Evolutionary Process Over Hundreds of Millions of Years

 

Trained on a scale and complexity of parameters rarely seen, ESM3’s first achievement was to reason through the design process and generate new green fluorescent protein (GFP) candidates—candidates that would have taken 500 million years to emerge through natural evolution.

 

Green Fluorescent Protein (GFP) is a protein of significant importance in biological scientific research, originally discovered in a jellyfish named Aequorea victoria. GFP is renowned for its unique intrinsic fluorescence properties, enabling it to emit green light when exposed to ultraviolet or blue light. This fluorescence is caused by its internal chromophore, which emits green light after absorbing light energy.

 

GFP has a wide range of applications, particularly in the fields of molecular and cell biology. By fusing the GFP gene with other target genes, scientists can produce GFP-tagged proteins, enabling real-time observation and tracking of protein expression and localization in living cells. This technology has greatly advanced our understanding of cellular processes and biomolecular interactions.

 

To accommodate diverse research needs, scientists have engineered GFP using genetic engineering techniques to create a series of fluorescent protein variants in different colors. These variants span the spectrum from blue to red, enabling researchers to simultaneously track multiple distinct biological markers in a single experiment.

 

The discovery and application of Green Fluorescent Protein (GFP) have had a profound impact on the field of biological sciences, earning its contributors—Osamu Shimomura, Martin Chalfie, and Roger Y. Tsien—the 2008 Nobel Prize in Chemistry. As a biological marker, GFP has played a positive role in reducing the use of laboratory animals by enabling research in vitro or in simpler model organisms, and it has now become an indispensable tool in modern biotechnological research.

 

GFP consists of a barrel-shaped structure composed of 11 β-sheets with an α-helix in the center. Upon folding, GFP undergoes a spontaneous reaction to form a fluorescent chromophore. Currently, no other known proteins exhibit this unique mechanism.

 

Even for nature, producing fluorescence is difficult. The evolutionary process of generating new fluorescent proteins takes a long time; natural fluorescent proteins have diverged from ancestral sequences in ancient history over hundreds of millions of years to become the proteins we see today.

 

Inspired by the structure of certain residues in the natural GFP core, EvolutionaryScale enabled ESM3 to reason through this concept and generate new GFP candidates. In fact, from 20 229 x 4096 229It is virtually impossible to generate a candidate purely by chance within such vast sequence and structural spaces. In its initial experiment, Evolutionary Scale tested 96 protein candidates and identified one with a chromophore that matured within a week, albeit with a brightness 50-fold lower than that of natural GFP. Building on this insight, the team generated another set of 96 proteins; subsequent testing revealed several candidates with brightness comparable to natural GFP, the brightest of which was named esmGFP. esmGFP differs from its closest naturally occurring fluorescent protein counterpart by 96 mutations, with only 58% sequence similarity across its 229 amino acids.

 

Through further analysis, the Evolutionary Scale team discovered that esmGFP, generated by ESM3 through simulated evolution, would actually require over 500 million years of natural evolution to emerge in nature.


OpenESM3Model, Empowering the Industry to Explore the Frontiers of Protein Design and Synthetic Biology Technologies


ESM3, a multimodal generative language model, can understand and generate protein sequences, structures, and functions, providing scientists with a powerful platform for designing and engineering biological components. This not only accelerates the process of protein design and discovery but also offers new perspectives for studying the principles of evolutionary biology.

 

The emergence of such large AI models has the potential to transform the biotechnology industry, particularly in areas such as drug development, biocatalyst engineering, and biomaterials design.

 

In light of this achievement, Tom Ellis, a renowned synthetic biologist in the UK, commented on ESM3, stating that this novel model signifies that “Design” is no longer the rate-limiting step or bottleneck within the DBTL paradigm; future efforts should now focus on addressing the bottlenecks in the “Building” and “Testing” phases.

 

Currently, ESM models have achieved influential applications in both research and industry. For instance, scientists have utilized ESM-1v and ESM-1b to evolve antibodies, thereby improving therapeutic properties. BioNTech and InstaDeep fine-tuned ESM language models on the COVID spike protein to detect variants posing higher risks to public health, successfully flagging all 16 Variants of Concern prior to their designation by the World Health Organization. Additionally, ESM-1b has been employed to predict the clinical effects of mutations and to design novel anti-CRISPR protein variants that maintain on-target editing functionality while reducing off-target side effects.

 

Furthermore, ESM models have accelerated scientific research and development in the industry by guiding the evolution of multiple proteins, predicting enzyme functions, and designing antibodies with high binding affinity.

 

To support industry development and continue exploring the frontiers of protein design and synthetic biology, Evolutionary Scale has released portions of its ESM3 model. The company is also developing specialized versions of ESM3 to unlock cutting-edge applications in drug design, with the aim of assisting scientists in creating new therapeutics through protein design.