Proteins: The Foundation of Disease Understanding and Treatment
Proteins are essential biological molecules that perform critical functions in our bodies and are fundamental to understanding diseases. By characterizing proteins, researchers can uncover disease mechanisms, identify strategies to slow progression, and even explore possibilities for reversal. Furthermore, the creation of new proteins can pave the way for innovative classes of drugs and therapeutics.
However, the current methodologies for designing proteins in laboratories are resource-intensive, both in terms of computational power and human expertise. The process typically involves devising a plausible protein structure for a specific task within the body and identifying a corresponding protein sequence—the amino acid order crucial for proper folding. Correct folding into three-dimensional shapes is necessary for proteins to function effectively.
Fortunately, it doesn’t have to be this complex.
This week, Microsoft launched EvoDiff, a groundbreaking general-purpose framework that claims to produce "high-fidelity" and "diverse" proteins directly from a protein sequence. Unlike traditional protein-generating frameworks, EvoDiff eliminates the need for structural information, thereby streamlining one of the most labor-intensive steps in protein design.
Available as open-source software, EvoDiff has the potential to create enzymes for novel therapeutics, enhance drug delivery systems, and even develop enzymes for industrial chemical processes, according to Kevin Yang, a senior researcher at Microsoft.
“We envision EvoDiff expanding protein engineering capabilities beyond conventional structure-function paradigms towards programmable, sequence-first design,” Yang said in an email interview. “With EvoDiff, we’re showing that we may not need structural information, but rather that ‘protein sequence is all you need’ to systematically create new proteins.”
At the core of the EvoDiff framework is a model with 640 million parameters, trained on data encompassing various species and functional protein classes. These parameters, key components of the AI model, define its ability to generate proteins effectively. The training data was sourced from the OpenFold dataset for sequence alignments and UniRef50, a subset from UniProt, the well-known protein sequence and functional information database.
EvoDiff utilizes a diffusion model architecture, analogous to modern image-generating models like Stable Diffusion and DALL-E 2. It learns to gradually reduce noise from an initial protein sequence, transforming it step-by-step into a coherent protein sequence.
EvoDiff's ability to generate proteins extends beyond creating new molecular forms; it can also address gaps in existing protein designs. For instance, when provided with a section of a protein that interacts with another, EvoDiff can generate a complete amino acid sequence that aligns with specific criteria.
Since EvoDiff operates within "sequence space" rather than focusing solely on protein structure, it is also capable of synthesizing "disordered proteins" that do not conform to final three-dimensional structures. Disordered proteins, like their structured counterparts, play crucial roles in biological processes and disease modulation, influencing the activity of other proteins.
It’s important to note that the research underpinning EvoDiff has not yet undergone peer review. Sarah Alamdari, a data scientist at Microsoft involved in the project, acknowledged that there’s significant scaling work ahead before the framework can be considered for commercial use.
“This is just a 640-million-parameter model, and scaling up to billions of parameters could enhance generation quality,” Alamdari noted in an email. “While we demonstrated some initial strategies, achieving even more precise control will require conditioning EvoDiff on text or chemical information, as well as specifying desired functions.”
Looking ahead, the EvoDiff team intends to validate the proteins generated by the model through laboratory testing. If these proteins prove viable, they'll embark on developing the next generation of this innovative framework.