Rapid in silico directed evolution by a protein language model with EVOLVEpro 文献解读

标签：directed evolution Rapid labels iteration iter train test model

文章背景

蛋白质是生命活动的基石，其功能和序列之间的复杂关系长期以来吸引着科学家们的关注。尽管深度突变扫描等实验方法可以解析蛋白质突变的功能影响，但这些技术的应用范围局限于序列空间的一小部分。近年来，基于蛋白质语言模型（PLM）的计算方法如ESM2模型取得了一些突破。然而，这些模型在零样本预测中往往无法显著提高蛋白质活性。为了解决这一问题，作者提出了EVOLVEpro，一个结合PLM和回归模型的少样本主动学习框架，用于蛋白质的快速优化。

实验方法

EVOLVEpro通过以下策略实现高效的蛋白质定向进化：

模型架构
EVOLVEpro的核心包括：
- 使用ESM2模型将蛋白质序列嵌入到高维潜在空间中。
- 构建一个随机森林回归器以学习嵌入向量与功能活性之间的关系。
- 在迭代回合中通过少量突变的实验数据优化模型。
主动学习流程
每轮优化：
- 模型根据预测的活性值对未测试突变排序。
- 选择高活性的突变进行实验验证。
- 将实验结果反馈至模型，进一步改进预测性能。
优化与验证
作者基于12个不同的深度突变扫描数据集优化EVOLVEpro的参数，并验证了其在抗体优化、CRISPR核酸酶改良、RNA聚合酶进化等多种任务中的性能。

结果与讨论

性能评估
- EVOLVEpro在低样本设置中取得了显著的优化效果，相较于传统方法提高了最多100倍的功能表现。
- 模型在抗体优化、基因组编辑工具和RNA生产酶的开发中展示了卓越的多目标优化能力。
抗体优化
- 对SARS-CoV-2刺突蛋白抗体进行多目标优化，包括结合亲和力和表达水平。
- 通过迭代优化，产生了结合力提高10倍的突变体，同时兼顾了可开发性。
CRISPR核酸酶优化
- 成功优化了体积更小但活性较低的Cas12f核酸酶，突变体的靶点编辑效率提高了44倍。
RNA聚合酶进化
- 优化后的T7 RNA聚合酶生成的RNA在免疫原性降低515倍的同时，其翻译效率提升57倍。
其他应用
- 包括Bxb1整合酶的活性提升以及Prime编辑器的长片段插入能力优化。

总结与展望

EVOLVEpro展示了人工智能在蛋白质工程中的巨大潜力，其模块化设计可适配多种蛋白质特性优化任务。未来的发展方向可能包括：

结合生成式PLM进行端到端的设计与优化。
将物理化学模型整合到现有框架中，以进一步提高预测精度。
扩展到更多复杂多目标任务，推动生物技术的前沿探索。

模型详解

根据蛋白质序列对每个位点进行突变，形成突变文库

对于WT序列长度为72，那么突变体文库的大小为72 $\times$19 = 1368 ，加上WT一共有1369条序列

from evolvepro.src.process import generate_wt, generate_single_aa_mutants
generate_wt('MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR', output_file='output/kelsic_WT.fasta')
generate_single_aa_mutants('output/kelsic_WT.fasta', output_file='output/kelsic.fasta')

对突变体文库中的每一条序列使用PLM进行embedding

python evolvepro/plm/esm/extract.py esm1b_t33_650M_UR50S /output/kelsic.fasta /output/kelsic_esm1b_t33_650M_UR50S --toks_per_batch 512 --include mean --concatenate_dir /output

从突变体文库中随机选择num_mutants条序列作为第一轮定向进化的序列，如下表Variant列所示

通过实验测定这些突变体的酶活

from evolvepro.src.process import suggest_initial_mutants
suggest_initial_mutants('/content/output/kelsic.fasta', num_mutants=12, random_seed=42)

Variant	activity
3F	0.905
4M	1.008
4Y	1.017
7A	0.82275
10P	0.9205
23K	0.989
23N	0.903
31C	1.02
36D	0.2915
38E	0.507
58E	0.909
61W	0.908

使用实验测量的数据构建模型

通过构建的模型对突变体文库中未进行实验测量的序列进行预测

选取预测结果最优的number_of_variants条序列进行下一轮进化（少样本学习）

通过实验测定第i轮选取的候选序列的酶活，并将其补充到前i-1轮的实验测量数据集中（可以根据情况只选择其中k轮），这个过程称为主动学习（active learning）

from evolvepro.src.evolve import evolve_experimental

protein_name = 'kelsic'
embeddings_base_path = 'output'
embeddings_file_name = 'kelsic_esm1b_t33_650M_UR50S.csv'
round_base_path = 'colab/rounds_data'
wt_fasta_path = "output/kelsic_WT.fasta"
number_of_variants = 12
output_dir = 'output'
rename_WT = False

round_name = 'Round1'
round_file_names = ['kelsic_Round1.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)


round_name = 'Round2'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)
.......

对具体的预测模型进行解析

具体来说就是具有通过实验测量数据的序列进行建模，特征为序列的embedding，target为酶活。随后模型对未进行实验测量的序列进行预测。模型可以选择机器学习回归模型（随机森林回归等）。

def top_layer(iter_train, iter_test, embeddings_pd, labels_pd, measured_var, regression_type='randomforest', top_n=None, final_round=10, experimental=False):
    # save column 'iteration' in the labels dataframe
    iteration = labels_pd['iteration']

    # save labels
    labels = labels_pd

    # save mean embeddings as numpy array
    a = embeddings_pd

    # subset a, y to only include the rows where iteration = iter_train and iter_test
    idx_train = iteration[iteration.isin(iter_train)].index.to_numpy()
    if iter_test is not None:
        idx_test = iteration[iteration == iter_test].index.to_numpy()
    else:
        idx_test = iteration[iteration.isna()].index.to_numpy()

    # subset a to only include the rows where iteration = iter_train and iter_test
    X_train = a.loc[idx_train, :]
    X_test = a.loc[idx_test, :]
    
    y_train = labels[iteration.isin(iter_train)][measured_var]
    y_train_activity_scaled = labels[iteration.isin(iter_train)]['activity_scaled']
    y_train_activity_binary = labels[iteration.isin(iter_train)]['activity_binary']

    if iter_test is not None:
        y_test = labels[iteration.isin([iter_test])][measured_var]
        print(y_test.shape)
        y_test_activity_scaled = labels[iteration.isin([iter_test])]['activity_scaled']
        y_test_activity_binary = labels[iteration.isin([iter_test])]['activity_binary']
    else:
        y_test = labels[iteration.isna()][measured_var]
        print(y_test.shape)
        y_test_activity_scaled = labels[iteration.isna()]['activity_scaled']
        y_test_activity_binary = labels[iteration.isna()]['activity_binary']        


    # fit
    if regression_type == 'ridge':
        model = linear_model.RidgeCV()
    elif regression_type == 'lasso':
        model = linear_model.LassoCV(max_iter=100000,tol=1e-3)
    elif regression_type == 'elasticnet':
        model = linear_model.ElasticNetCV(max_iter=100000,tol=1e-3)
    elif regression_type == 'linear':
        model = linear_model.LinearRegression()
    elif regression_type == 'neuralnet':
        model = MLPRegressor(hidden_layer_sizes=(5), max_iter=1000, activation='relu', solver='adam', alpha=0.001,
                             batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5,
                             momentum=0.9, nesterovs_momentum=True, shuffle=True, random_state=1, tol=0.0001,
                             verbose=False, warm_start=False, early_stopping=False, validation_fraction=0.1, beta_1=0.9,
                             beta_2=0.999, epsilon=1e-08)
    elif regression_type == 'randomforest':
        model = RandomForestRegressor(n_estimators=100, criterion='friedman_mse', max_depth=None, min_samples_split=2,
                                      min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0,
                                      max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False,
                                      n_jobs=None, random_state=1, verbose=0, warm_start=False, ccp_alpha=0.0,
                                      max_samples=None)
    elif regression_type == 'gradientboosting':
        model = xgboost.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1,
                                     max_depth=5, alpha=10, n_estimators=10)
    elif regression_type == 'knn':
        model = KNeighborsRegressor(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                                    metric='minkowski', metric_params=None, n_jobs=None)
    elif regression_type == 'gp':
        model = GaussianProcessRegressor(kernel=None, alpha=1e-10, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0,
                                        normalize_y=False, copy_X_train=True, random_state=None)

    model.fit(X_train, y_train)

    # make predictions on train data
    y_pred_train = model.predict(X_train)

标签：directed,evolution,Rapid,labels,iteration,iter,train,test,model
From： https://www.cnblogs.com/crazypigf/p/18568388