To create proteins with useful functions, researchers typically start with a natural protein that has a desired function, like emitting fluorescent light, and subject it to multiple rounds of random mutation to eventually produce an optimized version of the protein.
This process has successfully generated optimized versions of many important proteins, such as green fluorescent protein (GFP). However, for some proteins, generating an optimized version has been challenging. MIT scientists have now developed a computational approach that simplifies the prediction of mutations that can lead to improved proteins, based on a small amount of data.
Using this model, the researchers created proteins with mutations predicted to enhance GFP and a protein from adeno-associated virus (AAV), which is utilized for gene therapy DNA delivery. They believe this approach could also help develop additional tools for neuroscience research and medical applications.
“Designing proteins is complex because the mapping from DNA sequence to protein structure and function is intricate. There may be a superior protein 10 changes away in the sequence, but each intermediate change could result in a nonfunctional protein. It’s like navigating a mountain range to find a river basin, with obstacles along the way. Our work aims to simplify finding the ‘riverbed’,” explains Ila Fiete, a professor at MIT.
Regina Barzilay and Tommi Jaakkola, also senior authors of the study, along with MIT graduate students Andrew Kirjner and Jason Yim, developed this approach. The research will be presented at the International Conference on Learning Representations in May.
The researchers initially aimed to develop proteins for use as voltage indicators in living cells. These proteins emit fluorescent light when an electric potential is detected, and if engineered for mammalian cells, they could enable measuring neuron activity without electrodes.
With the help of computational modeling, the researchers sought to predict better versions of GFP. By training a convolutional neural network (CNN) on experimental GFP data, they were able to create a fitness landscape based on brightness, optimizing the protein.
Using this method, the researchers identified optimized GFP sequences with up to seven different amino acids from the original sequence, estimated to be 2.5 times fitter than the original.
This approach also successfully identified new sequences for the viral capsid of adeno-associated virus (AAV), optimizing it for DNA packaging. The researchers plan to apply this technique to data on voltage indicator proteins for further advancements in protein engineering.
This research was supported by various funding sources, including the U.S. National Science Foundation, DARPA, NIH, and other institutions.