从 Pandas 到 Polars 十四：在Polars中拟合线性模型

标签：df self Pandas --- ols 拟合 Polars col pl

线性模型是数据科学和机器学习的基石。它们易于理解且拟合速度快。有了这个出色的新Polars插件，你现在可以直接在Polars中拟合线性模型，包括Lasso回归和Ridge回归。
这项工作是由Amzy Rajab在这个github仓库中完成的。
如果你想跟着学习，第一步是安装这个插件。你可以通过以下命令来完成这个操作。

pip install polars-ols patsy

在这里，patsy包被用于解析公式（尽管我们在下面没有使用它）。

注册命名空间
我们使用polars-ols包来拟合模型。Polars-ols是一个Polars插件。当我们导入一个插件时，该插件会将其命名空间注册到Polars中。命名空间是一组在特定标题下收集的表达式，其工作方式类似于内置命名空间，如用于时间序列表达式的dt或用于字符串表达式的str。
我们首先导入Polars库和插件。

import polars as pl
import polars_ols as pls

当我们导入一个插件时，该插件会将其命名空间注册到Polars中。之后，我们就可以访问该命名空间中的表达式，在这个例子中，这个命名空间被称为least_squares。

拟合线性模型
我们创建了一个DataFrame，其中包含目标列y，并使用两个预测列x1和x2对其进行回归。

df = pl.DataFrame(
    {
        "y": [1.16, -2.16, -1.57, 0.21, 0.22, 1.6, -2.11, -2.92, -0.86, 0.47],
        "x1": [0.72, -2.43, -0.63, 0.05, -0.07, 0.65, -0.02, -1.64, -0.92, -0.27],
        "x2": [0.24, 0.18, -0.95, 0.23, 0.44, 1.01, -2.08, -1.36, 0.01, 0.75],
    }
)
df.head()
shape: (5, 3)
┌───────┬───────┬───────┐
│ y     ┆ x1    ┆ x2    │
│ ---   ┆ ---   ┆ ---   │
│ f64   ┆ f64   ┆ f64   │
╞═══════╪═══════╪═══════╡
│ 1.16  ┆ 0.72  ┆ 0.24  │
│ -2.16 ┆ -2.43 ┆ 0.18  │
│ -1.57 ┆ -0.63 ┆ -0.95 │
│ 0.21  ┆ 0.05  ┆ 0.23  │
│ 0.22  ┆ -0.07 ┆ 0.44  │
└───────┴───────┴───────┘

我们首先拟合一个普通最小二乘（即普通的线性回归）模型。我们指定：
1.目标列为 pl.col("y")
2.使用 least_squares.ols 表达式表示的普通最小二乘模型
3.预测器作为 least_squares.ols 内的表达式列表
4.预测输出列的名称，使用 alias 指定

ols_expr = (
  pl.col("y")
  .least_squares.ols(
      pl.col("x1"), 
      pl.col("x2")
  )
  .alias("ols")
)

然后，我们可以通过将表达式传递给with_columns来添加一个包含预测值的列。

(
  df
  .with_columns(
      ols_expr
  )
)       
shape: (10, 4)
┌───────┬───────┬───────┬───────────┐
│ y     ┆ x1    ┆ x2    ┆ ols       │
│ ---   ┆ ---   ┆ ---   ┆ ---       │
│ f64   ┆ f64   ┆ f64   ┆ f32       │
╞═══════╪═══════╪═══════╪═══════════╡
│ 1.16  ┆ 0.72  ┆ 0.24  ┆ 0.940459  │
│ -2.16 ┆ -2.43 ┆ 0.18  ┆ -2.196536 │
│ -1.57 ┆ -0.63 ┆ -0.95 ┆ -1.55357  │
│ 0.21  ┆ 0.05  ┆ 0.23  ┆ 0.275953  │
│ 0.22  ┆ -0.07 ┆ 0.44  ┆ 0.366057  │
│ 1.6   ┆ 0.65  ┆ 1.01  ┆ 1.632354  │
│ -2.11 ┆ -0.02 ┆ -2.08 ┆ -2.07331  │
│ -2.92 ┆ -1.64 ┆ -1.36 ┆ -2.945234 │
│ -0.86 ┆ -0.92 ┆ 0.01  ┆ -0.889025 │
│ 0.47  ┆ -0.27 ┆ 0.75  ┆ 0.476734  │
└───────┴───────┴───────┴───────────┘

瞧！我们已经在ols列中得到了线性模型的预测值。

拟合正则化模型
该库还支持拟合正则化模型，如Lasso和Ridge回归。我们可以通过使用least_squares.lasso而不是least_squares.ols来拟合一个Lasso模型。同时，我们还需要指定alpha参数，该参数表示正则化的强度。

lasso_expr = (
    pl.col("y")
    .least_squares.lasso(
        pl.col("x1"), 
        pl.col("x2"), 
        alpha=0.0001, 
        add_intercept=True
    )
)

注意 - 我将Elastic Net模型的结果与Scikit-learn库的结果进行了比较，它们非常接近。

训练模型以在测试集上使用
在上面的例子中，我们在与训练模型相同的数据上进行了预测。但在实际应用中，我们通常希望在训练集上训练模型，然后将其应用于新数据。使用python-ols（可能是指Polars库中的OLS实现）可以实现这一点，因为我们也可以让它输出回归系数而不是预测值。
回到我们最初的普通最小二乘模型，我们可以通过在ols方法中设置mode="coefficients"来获取系数。

ols_coef_expr = (
    pl.col("y")
    .least_squares.ols(
        pl.col("x1"), 
        pl.col("x2"), 
        add_intercept=True,
        mode="coefficients"
    )
    .alias("ols_intercept")
)

系数以pl.Struct列的形式返回，我们通过解嵌套将其值作为单独的列获取。

(
    df
    .select(
        ols_coef_expr
    )
    .unnest("ols_intercept")
)
shape: (1, 3)
┌──────────┬──────────┬──────────┐
│ x1       ┆ x2       ┆ const    │
│ ---      ┆ ---      ┆ ---      │
│ f32      ┆ f32      ┆ f32      │
╞══════════╪══════════╪══════════╡
│ 0.977375 ┆ 0.987413 ┆ 0.000757 │
└──────────┴──────────┴──────────┘

既然我们已经有了系数，我们就可以用它们来预测新数据集上的目标变量。
为了使这个流程更加顺畅，我们可以采用经典的Scikit-learn方法，即具有fit和transform方法（尽管你也可以将其称为predict方法）。我们可以通过创建具有这些方法的线性回归类来实现这一点。

from typing import List
class LinearRegressor:
    def __init__(
        self,
        target_col:str="y",
        feature_cols:List[str]=["x1","x2"],
        model="ols",
        add_intercept:bool=False
    ):
        self.target_col = target_col
        self.feature_cols = [pl.col(feature) for feature in feature_cols]
        self.add_intercept = add_intercept
        if model == "ols":
            self.model_expr = (
            pl.col(self.target_col)
            .least_squares.ols(
                *self.feature_cols,
                mode="coefficients",
                add_intercept=self.add_intercept
            )
            .alias("coef")
        )

    def fit(self, X):
        # Fit the model and save the coefficients in a DataFrame
        self.coefficients_df = (
            X
            .select(
                self.model_expr
            )
            .unnest("coef")
        )
        self.coef_ = (
            self.coefficients_df
            .select(self.feature_cols)
            .melt()
        )
        if self.add_intercept:
            self.intercept_ = self.coefficients_df["const"][0]
        else:
            self.intercept_ = 0.0
        return self

    def transform(self, X: pl.DataFrame):
        # Make predictions using the saved coefficients
        return (
            X
            # Add a row index
            .with_row_index()
            .pipe(
                # Join the predictions
                lambda X: X.join(
                    # Select the predictor columns
                    X.select("index", *self.feature_cols)
                    # Melt (so we can join the coefficients)
                    .melt(id_vars="index",value_name="predictor")
                    .join(
                        # Join the coefficients
                    (
                        X
                        .select(
                            self.model_expr
                        )
                        .unnest("coef")
                        .melt(value_name="coef")
                    ),
                        on="variable",
                    )
                    # Multiply by the predictors
                    .with_columns(pred=(pl.col("predictor") * pl.col("coef")))
                    # Gather back up into rows
                    .group_by("index")
                    .agg(
                        pl.col("pred").sum() + self.intercept_
                    ),
                    on="index",
                )
            )
            .sort("index")
        )

基本思路是：
1.在fit中，我们创建之前看到的包含变量名和系数的DataFrame
2.在transform中，我们将特征矩阵X转换为长格式，以便我们可以根据变量名将系数与数据连接起来。然后，我们将系数乘以数据，并对同一行上的所有数据进行求和，以获取预测值。
然后，我们使用这个类对测试DataFrame进行预测，如下所示：

df_train, df_test = df[:7], df[7:]

linear_regressor = LinearRegressor(target_col="y",feature_cols=["x1","x2"],model="ols")

reg.fit(X=df_train)

reg.transform(X=df_test)
shape: (3, 5)
┌───────┬───────┬───────┬───────┬───────────┐
│ index ┆ y     ┆ x1    ┆ x2    ┆ pred      │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---       │
│ u32   ┆ f64   ┆ f64   ┆ f64   ┆ f64       │
╞═══════╪═══════╪═══════╪═══════╪═══════════╡
│ 0     ┆ -2.92 ┆ -1.64 ┆ -1.36 ┆ -2.945234 │
│ 1     ┆ -0.86 ┆ -0.92 ┆ 0.01  ┆ -0.889025 │
│ 2     ┆ 0.47  ┆ -0.27 ┆ 0.75  ┆ 0.476734  │
└───────┴───────┴───────┴───────┴───────────┘

这只是使用polars-ols插件可以做的开始。我没有在这里涵盖的主题包括将数据拟合到不同的子组、在滚动窗口上拟合模型以及使用公式拟合模型。

标签：df,self,Pandas,---,ols,拟合,Polars,col,pl
From： https://blog.csdn.net/sosogod/article/details/140399659

从 Pandas 到 Polars 十四：在Polars中拟合线性模型

相关文章

赞助商

阅读排行