首页 > 其他分享 >数据分析(三)线性回归模型实现

数据分析(三)线性回归模型实现

时间:2024-04-01 17:58:05浏览次数:17  
标签:数据分析 ncols 模型 range nrows beta betaMat 线性 append

1. 惩罚线性回归模型概述

线性回归在实际应用时需要对普通最小二乘法进行一些修改。普通最小二乘法只在训练数据上最小化错误,难以顾及所有数据。

惩罚线性回归方法是一族用于克服最小二乘法( OLS)过拟合问题的方法。岭回归是惩罚线性回归的一个特例。岭回归通过对回归系数的平方和进行惩罚来避免过拟合。其他惩罚回归算法使用不同形式的惩罚项。

下面几个特点使得惩罚线性回归方法非常有效:

--模型训练足够快速。

--变量的重要性信息。

--部署时的预测足够快速。

--在各种问题上性能可靠,尤其对样本并不明显多于属性的属性矩阵,或者非常稀疏的矩阵。希望模型为稀疏解(即只使用部分属性进行预测的吝啬模型)。

--问题可能适合使用线性模型来解决。

公式 4-6 可以用如下语言描述:向量 beta 是以及常量 beta 零星是使期望预测的均方

错误最小的值,期望预测的均方错误是指在所有数据行(i=1,...,n)上计算 yi 与预测生成

yi 之间的错误平方的平均。

岭惩罚项对于惩罚回归来说并不是唯一有用的惩罚项。任何关于向量长度的指标都可以。使用不同的长度指标可以改变解的重要性。岭回归应用欧式几何的指标(即 β 的平方和)。另外一个有用的算法称作套索(Lasso)回归,该回归源于出租车的几何路径被称作曼哈顿距离或者 L1 正则化(即 β 的绝对值的和)。ElasticNet 惩罚项包含套索惩罚项以及岭惩罚项。

2. 求解惩罚线性回归问题

有大量通用的数值优化算法可以求解公式 4-6、公式 4-8 以及公式 4-11 对应的优化问题,但是惩罚线性回归问题的重要性促使研究人员开发专用算法,从而能够非常快地生成解。本文将对这些算法进行介绍并且运行相关代码,重点介绍2种算法:最小角度回归 LARS 以及 Glmnet。

LARS 算法可以理解为一种改进的前向逐步回归算法。

之所以介绍 LARS 算法是因为该算法非常接近于套索以及前向逐步回归, LARS 算法很容易理解并且实现起来相对紧凑。通过研究 LARS 的代码,你会理解针对更一般的 ElasticNet 回归求解的具体过程,并且会了解惩罚回归求解的细节。

2. 完整代码(code)

from math import sqrt
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm


def x_normalized(xList, xMeans, xSD):
    nrows = len(xList)
    ncols = len(xList[0])
    xNormalized = []
    for i in range(nrows):
        rowNormalized = [(xList[i][j] - xMeans[j]) / xSD[j] for j in range(ncols)]
        xNormalized.append(rowNormalized)


def data_normalized(wine):
    nrows, ncols = wine.shape
    wineNormalized = wine
    for i in range(ncols):
        mean = summary.iloc[1, i]
        sd = summary.iloc[2, i]
        wineNormalized.iloc[:, i:(i + 1)] = (wineNormalized.iloc[:, i:(i + 1)] - mean) / sd
    return wineNormalized


def calculate_betaMat(nSteps, stepSize, wineNormalized):
    nrows, ncols = wineNormalized.shape
    # initialize a vector of coefficients beta(系数初始化)
    beta = [0.0] * (ncols - 1)
    # initialize matrix of betas at each step(系数矩阵初始化)
    betaMat = []
    betaMat.append(list(beta))
    # initialize residuals list(误差初始化)
    residuals = [0.0] * nrows
    for i in tqdm(range(nSteps)):
        # calculate residuals(计算误差)
        for j in range(nrows):
            residuals[j] = wineNormalized.iloc[j, (ncols - 1)]
            for k in range(ncols - 1):
                residuals[j] += - wineNormalized.iloc[j, k] * beta[k]

        # calculate correlation between attribute columns from normalized wine and residual(变量与误差相关系数)
        corr = [0.0] * (ncols - 1)
        for j in range(ncols - 1):
            for k in range(nrows):
                corr[j] += wineNormalized.iloc[k, j] * residuals[k] / nrows

        iStar = 0
        corrStar = corr[0]
        for j in range(1, (ncols - 1)):
            if abs(corrStar) < abs(corr[j]):  # 相关性大的放前面
                iStar = j
                corrStar = corr[j]
        beta[iStar] += stepSize * corrStar / abs(corrStar)  # 系数
        betaMat.append(list(beta))
    return betaMat


def plot_betaMat1(betaMat):
    ncols = len(betaMat[0])
    for i in range(ncols):
        # plot range of beta values for each attribute
        coefCurve = betaMat[0:nSteps][i]
        plt.plot(coefCurve)

    plt.xlabel("Attribute Index")
    plt.ylabel(("Attribute Values"))
    plt.show()


def plot_betaMat2(nSteps, betaMat):
    ncols = len(betaMat[0])
    for i in range(ncols):
        # plot range of beta values for each attribute
        coefCurve = [betaMat[k][i] for k in range(nSteps)]
        xaxis = range(nSteps)
        plt.plot(xaxis, coefCurve)

    plt.xlabel("Steps Taken")
    plt.ylabel(("Coefficient Values"))
    plt.show()


def S(z, gamma):
    if gamma >= abs(z):
        return 0.0
    return (z / abs(z)) * (abs(z) - gamma)


if __name__ == '__main__':
    target_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    wine = pd.read_csv(target_url, header=0, sep=";")

    # normalize the wine data
    summary = wine.describe()
    print(summary)

    # 数据标准化
    wineNormalized = data_normalized(wine)
    # number of steps to take(训练步数)
    nSteps = 100
    stepSize = 0.1
    betaMat = calculate_betaMat(nSteps, stepSize, wineNormalized)
    plot_betaMat1(betaMat)
# ----------------------------larsWine---------------------------------------------------
    # read data into iterable
    names = wine.columns
    xList = []
    labels = []
    firstLine = True
    for i in range(len(wine)):
        row = wine.iloc[i]
        # put labels in separate array
        labels.append(float(row[-1]))
        # convert row to floats
        floatRow = row[:-1]
        xList.append(floatRow)
    # Normalize columns in x and labels
    nrows = len(xList)
    ncols = len(xList[0])
    # calculate means and variances(计算均值和方差)
    xMeans = []
    xSD = []
    for i in range(ncols):
        col = [xList[j][i] for j in range(nrows)]
        mean = sum(col) / nrows
        xMeans.append(mean)
        colDiff = [(xList[j][i] - mean) for j in range(nrows)]
        sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
        stdDev = sqrt(sumSq / nrows)
        xSD.append(stdDev)

    # use calculate mean and standard deviation to normalize xList(X标准化)
    xNormalized = x_normalized(xList, xMeans, xSD)
    # Normalize labels: 将属性及标签进行归一化
    meanLabel = sum(labels) / nrows
    sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)
    labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]

    # initialize a vector of coefficients beta
    beta = [0.0] * ncols
    # initialize matrix of betas at each step
    betaMat = []
    betaMat.append(list(beta))
    # number of steps to take
    nSteps = 350
    stepSize = 0.004
    nzList = []
    for i in range(nSteps):
        # calculate residuals
        residuals = [0.0] * nrows
        for j in range(nrows):
            labelsHat = sum([xNormalized[j][k] * beta[k] for k in range(ncols)]) 
            residuals[j] = labelNormalized[j] - labelsHat  # 计算残差

        # calculate correlation between attribute columns from normalized wine and residual
        corr = [0.0] * ncols
        for j in range(ncols):
            corr[j] = sum([xNormalized[k][j] * residuals[k] for k in range(nrows)]) / nrows  # 每个属性和残差的关联

        iStar = 0
        corrStar = corr[0]
        for j in range(1, (ncols)):  # 逐个判断哪个属性对降低残差贡献最大
            if abs(corrStar) < abs(corr[j]):  # 好的(最大关联)特征会排到列表前面,应该保留,不太好的特征会排到最后
                iStar = j
                corrStar = corr[j]
        beta[iStar] += stepSize * corrStar / abs(corrStar)  # 固定增加beta变量值,关联为正增量为正;关联为负,增量为负
        betaMat.append(list(beta))  # 求解得到参数结果

        nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
        for q in nzBeta:
            if q not in nzList:  # 对于每一迭代步,记录非零系数对应索引
                nzList.append(q)
    nameList = [names[nzList[i]] for i in range(len(nzList))]
    print(nameList)
    plot_betaMat2(nSteps, betaMat)  # 绘制系数曲线

# -------------------------------larsWine 10折交叉------------------------------------------------
    # Build cross-validation loop to determine best coefficient values.
    # number of cross validation folds
    nxval = 10
    # number of steps and step size
    nSteps = 350
    stepSize = 0.004
    # initialize list for storing errors.
    errors = []  # 记录每一步迭代的错误
    for i in range(nSteps):
        b = []
        errors.append(b)

    for ixval in range(nxval):  # 10折交叉验证
        # Define test and training index sets
        idxTrain = [a for a in range(nrows) if a % nxval != ixval * nxval]
        idxTest = [a for a in range(nrows) if a % nxval == ixval * nxval]
        # Define test and training attribute and label sets
        xTrain = [xNormalized[r] for r in idxTrain]  # 训练集
        labelTrain = [labelNormalized[r] for r in idxTrain]
        xTest = [xNormalized[r] for r in idxTest]  # 测试集
        labelTest = [labelNormalized[r] for r in idxTest]

        # Train LARS regression on Training Data
        nrowsTrain = len(idxTrain)
        nrowsTest = len(idxTest)

        # initialize a vector of coefficients beta
        beta = [0.0] * ncols

        # initialize matrix of betas at each step
        betaMat = []
        betaMat.append(list(beta))
        for iStep in range(nSteps):
            # calculate residuals
            residuals = [0.0] * nrows
            for j in range(nrowsTrain):
                labelsHat = sum([xTrain[j][k] * beta[k] for k in range(ncols)])
                residuals[j] = labelTrain[j] - labelsHat
            # calculate correlation between attribute columns from normalized wine and residual
            corr = [0.0] * ncols
            for j in range(ncols):
                corr[j] = sum([xTrain[k][j] * residuals[k] for k in range(nrowsTrain)]) / nrowsTrain

            iStar = 0
            corrStar = corr[0]
            for j in range(1, (ncols)):
                if abs(corrStar) < abs(corr[j]):
                    iStar = j
                    corrStar = corr[j]
            beta[iStar] += stepSize * corrStar / abs(corrStar)
            betaMat.append(list(beta))

            # Use beta just calculated to predict and accumulate out of sample error - not being used in the calc of beta
            for j in range(nrowsTest):
                labelsHat = sum([xTest[j][k] * beta[k] for k in range(ncols)])
                err = labelTest[j] - labelsHat
                errors[iStep].append(err)
    cvCurve = []
    for errVect in errors:
        mse = sum([x * x for x in errVect]) / len(errVect)
        cvCurve.append(mse)
    minMse = min(cvCurve)
    minPt = [i for i in range(len(cvCurve)) if cvCurve[i] == minMse][0]
    print("Minimum Mean Square Error", minMse)
    print("Index of Minimum Mean Square Error", minPt)

    xaxis = range(len(cvCurve))
    plt.plot(xaxis, cvCurve)
    plt.xlabel("Steps Taken")
    plt.ylabel(("Mean Square Error"))
    plt.show()

    # -------------------------------glmnet larsWine2------------------------------------------------
    # select value for alpha parameter
    alpha = 1.0
    # make a pass through the data to determine value of lambda that
    # just suppresses all coefficients.
    # start with betas all equal to zero.
    xy = [0.0] * ncols
    for i in range(nrows):
        for j in range(ncols):
            xy[j] += xNormalized[i][j] * labelNormalized[i]

    maxXY = 0.0
    for i in range(ncols):
        val = abs(xy[i]) / nrows
        if val > maxXY:
            maxXY = val

    # calculate starting value for lambda
    lam = maxXY / alpha

    # this value of lambda corresponds to beta = list of 0's
    # initialize a vector of coefficients beta
    beta = [0.0] * ncols

    # initialize matrix of betas at each step
    betaMat = []
    betaMat.append(list(beta))

    # begin iteration
    nSteps = 100
    lamMult = 0.93  # 100 steps gives reduction by factor of 1000 in
    # lambda (recommended by authors)
    nzList = []
    for iStep in range(nSteps):
        # make lambda smaller so that some coefficient becomes non-zero
        lam = lam * lamMult

        deltaBeta = 100.0
        eps = 0.01
        iterStep = 0
        betaInner = list(beta)
        while deltaBeta > eps:
            iterStep += 1
            if iterStep > 100:
                break
            # cycle through attributes and update one-at-a-time
            # record starting value for comparison
            betaStart = list(betaInner)
            for iCol in range(ncols):
                xyj = 0.0
                for i in range(nrows):
                    # calculate residual with current value of beta
                    labelHat = sum([xNormalized[i][k] * betaInner[k] for k in range(ncols)])
                    residual = labelNormalized[i] - labelHat

                    xyj += xNormalized[i][iCol] * residual

                uncBeta = xyj / nrows + betaInner[iCol]
                betaInner[iCol] = S(uncBeta, lam * alpha) / (1 + lam * (1 - alpha))

            sumDiff = sum([abs(betaInner[n] - betaStart[n]) for n in range(ncols)])
            sumBeta = sum([abs(betaInner[n]) for n in range(ncols)])
            deltaBeta = sumDiff / sumBeta
        print(iStep, iterStep)
        beta = betaInner
        # add newly determined beta to list
        betaMat.append(beta)
        # keep track of the order in which the betas become non-zero
        nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
        for q in nzBeta:
            if q not in nzList:
                nzList.append(q)
    # print out the ordered list of betas
    nameList = [names[nzList[i]] for i in range(len(nzList))]
    print(nameList)
    nPts = len(betaMat)
    plot_betaMat2(nPts, betaMat)  # 绘制系数曲线

标签:数据分析,ncols,模型,range,nrows,beta,betaMat,线性,append
From: https://blog.csdn.net/Trisyp/article/details/137241669

相关文章

  • COT:大模型的强化利器
    大模型相关目录大模型,包括部署微调prompt/Agent应用开发、知识库增强、数据库增强、知识图谱增强、自然语言处理、多模态等大模型应用开发内容从0起步,扬帆起航。大模型应用向开发路径:AI代理工作流大模型应用开发实用开源项目汇总大模型问答项目问答性能评估方法大模型数......
  • 数据结构 第二章(线性表)
    写在前面:本系列笔记主要以《数据结构(C语言版)》为参考,结合下方视频教程对数据结构的相关知识点进行梳理。所有代码块使用的都是C语言,如有错误欢迎指出。视频链接:第01周a--前言_哔哩哔哩_bilibili一、线性表的定义和特点        同一线性表中的元素必定具有相同的特性......
  • 02-03线性代数
    2.3线性代数1.基本数学对象对象数学符号代码标量xtorch.tensor(1.0)向量xtorch.arange(3)矩阵Atorch.arange(20).reshape(5,4)张量Xtorch.arange(24).reshape(2,3,4)2.基本运算法则及方法2.1标量加法x+y;乘法x*y;除法x/y;指......
  • 【形式化方法模型在软件工程中的应用】
    文章目录前言什么是形式化方法模型?常见的形式化方法模型1.Z语言优点:缺点:2.B-Method优点:缺点:3.Alloy优点:缺点:前言形式化方法通过数学和形式化语言来描述和验证软件系统的行为。什么是形式化方法模型?形式化方法模型是一种用于软件开发的工程化方法,它通过形......
  • 如何系统学习数据分析?需要学习那些知识
    大数据时代到来,如何从数据中提取、挖掘对业务发展有价值的、潜在的知识,为决策层的决策提供有力依据,为产品或服务发展方向提供指引,有力推动企业管理的精益化,对于每个企业都意义重大。而这些工作,大多需要数据分析师才能完成,但如何才能系统学习数据分析成为一名合格的数据分析......
  • 写模板, 线性筛
    筛质数:1需要:bitset位标记,vector存储质数2流程:标记了就是质数,加到vector。用当前数遍历所有已知质数进行标记,直到质数跑完或者质数为当前数的因子。3注意事项:合数被标记的原理是因为每个合数都由最小质因子来标记,所以当质因子为i的因子时,直接break。4延申:根据线性筛可以找......
  • Python数据分析的基本过程
    一般来说,数据分析的基本过程包括以下几个步骤:1.提出问题——即我们所想要知道的指标(平均消费额、客户的年龄分布、营业额变化趋势等等)2.导入数据——把原始数据源导入JupyterNotebook中(网络爬虫、数据读取等)3.数据清洗——数据清洗是指发现并纠正数据文件中可识别的错......
  • 学习transformer模型-Dropout的简明介绍
    Dropout的定义和目的:Dropout是一种神经网络正则化技术,它在训练时以指定的概率丢弃一个单元(以及连接)p。这个想法是为了防止神经网络变得过于依赖特定连接的共同适应,因为这可能是过度拟合的症状。直观上,dropout可以被认为是创建一个隐式的神经网络集合。PyTorch的nn.Drop......
  • 每天一个数据分析题(二百四十四)
    LightGBM算法的哪些优化策略有助于提高模型的训练速度?A.Gradient-basedOne-SideSampling(GOSS)B.ExclusiveFeatureBundling(EFB)C.深度优先搜索(DFS)分裂D.使用L1正则化题目来源于CDA模拟题库点击此处获取答案......
  • CDA Club 第2期《数据分析组队打卡学习活动》正式开营!
    CDAClub第2期《数据分析组队打卡学习活动》正式开营!为增进国内外数据分析师爱好者对数据科学理论与工具实践的了解和认识,方便大家利用碎片化时间在线学习,CDA俱乐部旗下学术部于3月25日-4月24日举办第2期《数据分析组队打卡学习活动》活动。本次打卡共吸引了330余名来自......