Natural language model jump-starts protein design with creation of working enzymes

Scientists have created an AI system capable of generating artificial enzymes from scratch. In laboratory tests, some of these enzymes performed as well as those found in nature, even when their artificially generated amino acid sequences diverged significantly from any known natural protein.

The experiment demonstrates that natural language processing, although developed for reading and writing text, can learn at least some of the underlying principles of biology. Salesforce Research developed the AI ​​program, called ProGen, which uses next token prediction to assemble amino acid sequences into artificial proteins.

The scientists said the new technology could become more powerful than directed evolution, the Nobel Prize-winning protein design technology, and will energize the 50-year-old field of protein engineering by accelerating the development of new proteins that can be used for almost everything. from therapy to degrading plastic.

“Artificial designs work much better than designs inspired by the evolutionary process,” said James Fraser, PhD, professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy and author of the work, which was published on January 26. , in Natural biotechnology.

“The language model learns aspects of evolution, but it’s different from the normal evolutionary process,” Fraser said. “We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme that is incredibly thermostable or likes acidic environments or won’t interact with other proteins. »

To create the model, the scientists simply fed the amino acid sequences of 280 million different proteins of all kinds into the machine learning model and let it digest the information for a few weeks. Next, they refined the model by priming it with 56,000 sequences from five lysozyme families, along with contextual information about these proteins.

The model quickly generated one million sequences, and the research team selected 100 to test, based on their resemblance to natural protein sequences, as well as naturalistic “grammar” and “semantics”. » amino acids underlying AI proteins.

From this first batch of 100 proteins, which were screened in vitro by Tierra Biosciences, the team made five artificial proteins to test in cells and compared their activity to an enzyme present in chicken egg whites, known under the name of hen’s egg white lysozyme. (HEWL). Similar lysozymes are found in human tears, saliva, and milk, where they defend against bacteria and fungi.

Two of the artificial enzymes were able to break down bacterial cell walls with activity comparable to HEWL, but their sequences were only about 18% identical to each other. The two sequences were approximately 90% and 70% identical to any known protein.

A single mutation in a naturally occurring protein can stop it working, but in another round of screening the team found that AI-generated enzymes showed activity even when as low as 31.4% of their sequence resembled a known natural protein.

The AI ​​was even able to learn how the enzymes should be shaped, simply by studying the raw sequence data. Measured with X-ray crystallography, the atomic structures of the artificial proteins looked just as they should, although the sequences were unlike anything seen before.

Salesforce Research developed ProGen in 2020, based on a kind of natural language programming that their researchers originally developed to generate English text.

They knew from their previous work that the AI ​​system could self-learn the grammar and meaning of words, as well as other underlying rules that make for well-composed writing.

“When you train sequence-based models with lots of data, they’re really powerful in learning structure and rules,” said Nikhil Naik, PhD, director of AI research at Salesforce Research and main author of the article. “They learn which words can coexist, as well as compositionality. »

With proteins, the design choices were nearly limitless. Lysozymes are small like proteins, with up to about 300 amino acids. But with 20 possible amino acids, there are a huge number (20300) possible combinations. It’s more than taking all the humans that have lived through time, multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe.

Considering the limitless possibilities, it is remarkable that the model can generate working enzymes so easily.

“The ability to generate functional proteins from scratch demonstrates that we are entering a new era of protein design,” said Ali Madani, PhD, Founder of Profluent Bio, former Salesforce Research Scholar, and the author of the article. first author. “This is a versatile new tool available to protein engineers, and we look forward to seeing the therapeutic applications.” »

Further information : https://github.com/salesforce/progen

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.