Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC
Patrick Emami, Aidan Perreault, Jeff Law, David Biagioni, Peter C. St. John. Machine Learning: Science and Technology, 2023. Presented at NeurIPS’22 Workshop on Machine Learning in Structural Biology.
[DOI] [arxiv] [Paper code] [EvoProtGrad library]
Announcing EvoProtGrad
7/21/23: We have released a Python package accompanying this work with HuggingFace integration for supporting a wide range of protein language models. Check it out on Github.
pip install evo_prot_grad
Quick demo
Evolve the avGFP protein with Rostlab/prot_bert.
import evo_prot_grad
prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0)
variants, scores = evo_prot_grad.DirectedEvolution(
wt_fasta = 'test/gfp.fasta', # path to wild type fasta file
output = 'best', # return best, last, all variants
experts = [prot_bert_expert], # list of experts to compose
parallel_chains = 1, # number of parallel chains to run
n_steps = 20, # number of MCMC steps per chain
max_mutations = 10, # maximum number of mutations per variant
verbose = True # print debug info to command line
)()
Summary
Machine-learning-based directed evolution can be used to engineer new proteins that have improved productivity or catalyze new reactions. Basically, starting from a known protein sequence, directed evolution proceeds by identifying candidate mutations, experimentally verifying them, and then using the gained information to search for better mutations. The intention is for this iterative process to resemble natural evolution. One way ML can help streamline this design loop is by optimizing the candidate indentification step.
In this paper, we introduce a plug and play protein sampler for this step. The sampler mixes and matches unsupervised evolutionary sequence models (e.g., protein language models) with supervised sequence models that map sequence to function, without needing any model re-training or fine-tuning. Sampling is performed in sequence space to keep things as ‘‘black-box’’ as possible. Plug and play samplers have been successful in other domains, such as for controlled image and text generation.
Searching through protein sequence space efficiently, which is high-dimensional and discrete, is notoriously difficult. We address this in our MCMC-based sampler by using the gradient of the (differentiable) target function to bias a mutation proposal distribution towards favorable mutations. We show that this approach outperforms classic random search.
Impact: Improving the quality of candidate protein variants reduces the time spent conducting costly experiments in the wet lab and accelerates the engineering of new proteins.
Checking intuitions with a toy binary MNIST experiment
The task is to evolve the MNIST digit on the right, initially a 1, to maximize the sum of the two digits solely by flipping binary pixels. Since the digit on the left is a 9, the initial sum of the digits is 10. The maximum is 18 (9 + 9).
Here’s our sampler applied to a product of experts distribution that combines an unsupervised energy based model with a supervised ensemble of ConvNets trained to regress the sum.
If we remove the EBM from the product of experts, the search gets trapped in adversarial optima—white-noise images. This is because, in this case, gradient-based discrete MCMC is only using the gradient of the ConvNet ensemble to steer search. The EBM acts as a soft constraint that keeps search away from such optima.
If we replace gradient-based discrete MCMC with random-search-based simulated annealing, nearly every randomly sampled mutation is rejected as it does not increase the predicted sum.
In-silico directed evolution experiments
We conduct in-silico directed evolution experiments on 3 proteins with partially-characterized wide fitness landscapes (between 2-15 mutations from wild type): PABP_YEAST, UBE4B_MOUSE, and GFP.
Compared to random-walk-based black-box samplers (simulated annealing, Metropolis-adjusted Langevin Algorithm (MALA)), random sampling, and an evolutionary method (CMA-ES), our sampler explores sequence space most efficiently (higher $\log p(x)$ is better).
We show that we can plug in different protein language models: ESM2-35M, ESM2-150M, and ESM2-650M. We also combine ESM2-150M with a Potts model, which surprisingly achieved the best performance (red arrow).
See the paper and code for more details and results.