是否可以限制 scikit learn 模型仅预测某些标签？

标签：python scikit-learn

我有两个模型在多个标签上进行了训练，并用它来预测游戏的类型。我注意到，由于模型经过训练，有时相同的输入数据可能会让两个模型输出截然不同的流派。

我想将预测限制为另一个模型建议的内容，但不知道该怎么做。下面的示例

Model1_labels = ["JRPG", "Horror", "FPS", "Platformer"]
Model2_Labels = ["Mario", "War_shooter", "fantasy_rpg"]

training_data = Label1      Label2            Tags
               JRPG        fantasy_rpg       open_world, action, level-up, fantasy
               JRPG        fantasy_rpg       level-up, turn-based, fantasy
               FPS         War_shooter       open-world, 1st person, tanks, planes, shooter
               FPS         War_shooter       1st person, war, shooter, level-up
               JRPG        Mario             level-up, turn-based, shooter
               ...

从示例中，War_shooter 只能是 FPS，因为战争射击游戏的描述是在战争期间设置的 FPS 游戏。

但如何限制？

关于我如何训练和预测的代码以下： SDG_PARAMS_DICT：最终[Dict[str，Any]] = dict（alpha = 1e-5，penalty =“l2”，max_iter = 1000，loss =“log_loss”） VECTORIZER_PARAMS_DICT: Final[Dict[str, Any]] = dict(ngram_range=(1, 4), min_df=5, max_df=0.8)

def build_model(x_data, y_data) -> Pipeline:
    game_predict_pipeline = Pipeline(
        [
            ("vect", CountVectorizer(VECTORIZER_PARAMS_DICT)),
            ("tfidf", TfidfTransformer()),
            ("clf", SelfTrainingClassifier(SGDClassifier(**SDG_PARAMS_DICT), verbose=True)),
        ]
    )
    X_train, X_test, y_train, y_test = train_test_split(x_data,
                                                        y_data,
                                                        train_size=0.3)
     game_predict_pipeline.fit(X_train, y_train)

     return game_predict_pipeline 

 game_data = pd.read_excel("c:/my_game_data.xlsx", keep_default_na=False)

 model1 = build_model(game_data["Tag"], game_data["Label1"])
 model2 = build_model(game_data["Tag"], game_data["Label2"])

 test_tags = "level-up, open-world, shooter"

 model1.predict(test_tag)
 model2.predict(test_tag)

结果

model1 - correct
   FPS
model2 - incorrect
   Mario

我考虑过执行 Predict_proba 并从列表中删除标签，但他的不会改变预测的概率，导致许多分数未达到理论截止值

comparison_dict: Dict = {"FPS":["War_shooter"]}


prediction2: np.ndarray = model2.predict_proba(test_tag)
classes: np.ndarray = model2.classes_
prediction_dict: Dict = {}
for idx, model_cls in enumerate(classes):
    if model_cls in comparison_dict.get(model1_precition):
        if prediction2[0][idx] < 0.6: # cutoff
              prediction_dict[model_cls] = prediction2[0][idx]

结果

output predict_proba
    Mario      : 0.43
    War_shooter: 0.42
    fantasy_rpg: 0.15

output
    None

output without cutoff
    "War_shooter": 0.42

遇到的问题是的两个模型独立运作，无法利用彼此的预测来完善结果。可以通过几种方法来限制模型的预测并合并它们的知识：

1. 基于规则的后期处理：

这是提出的解决方案的一种变体。在的后处理步骤中，不是使用任意的截止值，而是实现明确的规则，基于一个模型的预测来过滤另一个模型的预测。
例如，可以创建一个规则字典，例如：

python restriction_rules = { "FPS": ["War_shooter"], "fantasy_rpg": ["JRPG"], }
然后，可以修改预测逻辑如下：

```python model1_prediction = model1.predict(test_tag)[0] model2_prediction = model2.predict(test_tag)[0]

allowed_predictions = restriction_rules.get(model1_prediction, [])

if model2_prediction in allowed_predictions: final_prediction = model2_prediction else: # 使用默认值、请求用户输入或跳过预测 final_prediction = "Unknown" print(final_prediction) ```

2. 层叠分类器：

在此方法中，可以训练第三个模型（“元分类器”），该模型将前两个模型的预测作为输入。
训练此元分类器以根据初始预测输出最终预测。
这种方法允许学习更复杂的规则，而不仅仅是简单的过滤。

```python from sklearn.ensemble import RandomForestClassifier

# Get predictions from the first two models model1_predictions = model1.predict(game_data["Tag"]) model2_predictions = model2.predict(game_data["Tag"])

# Combine predictions into a new feature matrix combined_predictions = np.column_stack((model1_predictions, model2_predictions))

# Train the meta-classifier meta_classifier = RandomForestClassifier() meta_classifier.fit(combined_predictions, game_data["Label2"])

# For prediction: test_prediction1 = model1.predict(test_tag) test_prediction2 = model2.predict(test_tag) combined_test_prediction = np.array([test_prediction1, test_prediction2]).reshape(1, -1) final_prediction = meta_classifier.predict(combined_test_prediction) ```

3. 结构化预测：

如果的标签之间存在固有的依赖关系（如的示例中所示），则可以探索结构化分类方法，例如条件随机场 (CRF)。
CRF 可以将不同标签之间的依赖关系考虑在内，并产生更一致的预测。
但是，与简单的分类器相比，CRF 的设置和调整更复杂。

4. 调整预测概率阈值：

可以调整模型 2 的 predict_proba 方法中的概率阈值，而不是完全删除标签。
这将允许控制模型在分配标签之前的置信度级别。

python threshold = 0.5 probabilities = model2.predict_proba(test_tag)[0] filtered_predictions = [model2.classes_[i] for i, prob in enumerate(probabilities) if prob >= threshold]

选择哪种方法取决于数据的特定特征以及需要达到的复杂程度。基于规则的后处理是最容易实现的，而层叠分类器和结构化预测提供了更高的灵活性，但也需要更多数据和调整。

标签：python,scikit-learn
From： 78786374

是否可以限制 scikit learn 模型仅预测某些标签？

相关文章

赞助商

阅读排行