首页 > 其他分享 >Approximation with Gradient Descent Method

Approximation with Gradient Descent Method

时间:2023-11-07 16:14:29浏览次数:128  
标签:plt frac Descent Gradient lambd Approximation gradient np theta

title: Approximation with Gradient Descent Method
layout: page
categories: data analysis

Polynomial Approximation with Gradient Descent Method

Suppose we have the following dataset hw10 datal. txt which is shown in Figure 1.1.

Figure 1.1 Plot of training data hw10data1.txt

hw10data1.txt is given by:

-1, 2.1
-0.7, 1.2
-0.4, 0.38
-0.1, 0.5
0.2, 0.48
0.48, 0.1
0.81, -0.4

From the plot, it seems that fitting a straight line might be too simple of an approximation. Instead, we try a fifth-order polynomial. Our hypothesis will be

\[h_{\theta}(x)=\theta^{\mathrm{T}} x=\theta_{0}+\theta_{1} x+\theta_{2} x^{2}+\theta_{3} x^{3}+\theta_{4} x^{4}+\theta_{5} x^{5} \]

Since we are fitting a 5 th-order polynomial to a data set of only 7 points, overfitting is likely to occur.

To avoid this, we use regularized linear regression.

  1. Write the regularized cost function;

  2. Derive the gradients of the above cost function;

  3. Write the gradient descent algorithm for the regularized linear regression;

  4. Plot the following three figures for the regularization parameter \(\lambda=0, \lambda=1\), and \(\lambda=10\) by using the gradient descent algorithm and explain your results;

Regularized Cost Function:

The ordinary const function can be written by

\[J(\theta) = \left(y - X\theta \right)^\top \left(y - X \theta \right) \]

where \(X\) is the Vandermonde matrix (\(n = 5\)), i.e.

\[\begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^n \\ 1 & x_2 & x_2^2 & \cdots & x_2^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^n \\ \end{bmatrix} \]

and the regularized const fucntion is

\[J(\theta) = \frac{1}{2m} \left(y - X\theta \right)^\top \left(y - X \theta \right) +\frac{1}{2m} \lambda \theta ^\top \theta \]

Gradients of the Cost Function:

To derive the gradient of (J(\theta)), we first calculate the derivative of \(J(\theta)\) without the regularization term:

\[\frac{\partial J(\theta)}{\partial \theta} = -\frac{1}{m}X^T(y - X\theta) \]

And then, we calculate the derivative of the regularization term:

\[\frac{\partial}{\partial \theta} \left(\frac{\lambda}{2m} \theta^T\theta\right) = \frac{\lambda}{m} \theta \]

The gradient of the regularized cost function (J(\theta)) is the sum of the gradients of the two parts:

\[\nabla J(\theta) = -\frac{1}{m}X^T(y - X\theta) + \frac{\lambda}{m} \theta \]

import numpy as np

def load_x():
    file_path = 'hw10data1.txt'
    x = []
    y = []
    with open(file_path, 'r') as file:
        lines = file.readlines()
        for line in lines:
            parts = line.strip().split(',')
            x.append(float(parts[0]))
            y.append(float(parts[1]))
    return np.array(x), np.array(y)

x, y = load_x()
print("x array:", x)
print("y array:", y)

x array: [-1.   -0.7  -0.4  -0.1   0.2   0.48  0.81]
y array: [ 2.1   1.2   0.38  0.5   0.48  0.1  -0.4 ]
def construct_vandermonde_matrix(x, n):
    vandermonde_matrix = np.column_stack([x ** i for i in range(n)])
    return vandermonde_matrix

Gradient Descent Algorithm

def gradient_descent(X, y, theta, learning_rate, iterations, lambd):
    m = len(y)
    for _ in range(iterations):
        gradient = -1/m * X.T.dot(y - X.dot(theta)) + lambd/m * theta
        theta = theta - learning_rate * gradient
    return theta
x, y = load_x()
n = 5
X = construct_vandermonde_matrix(x, n)

# Initialize theta, learning_rate, iterations, and lambd here
theta = np.zeros(n)
learning_rate = 0.01
iterations = 1000
lambd = 0.1

# Apply the gradient descent function
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)

print("Optimized theta values:", theta)
Optimized theta values: [ 0.35327229 -0.6234696   0.18933476 -0.61360503  0.2504074 ]
import matplotlib.pyplot as plt

def plot_fig():
    # Plotting
    x_values = np.linspace(-1, 1, 100)
    X_test = construct_vandermonde_matrix(x_values, n)
    y_values = X_test.dot(theta)
    
    plt.scatter(x, y, color='blue', label=f"training data")
    plt.plot(x_values, y_values, color='red', label=f"{n}th order, lambda = {lambd}")
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.show()
    

lambd = 0
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig()


png

lambd = 1
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig()


png

lambd = 10
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig()


png

Logistic Regression with Gradient Descent Method

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset hw10data2.txt of test results on past microchips, from which you can build a logistic regression model.

Figure 2.1 Plot of training data hw10data2. txt

Figure 2.1 shows that the dataset cannot be separated into positive and negative examples by a straight line through the plot. Therefore, a straight-forward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary. One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of \(x_{1}\) and \(x_{2}\) up to the sixth power. That means we will use the following hypothesis for the logistic regression classifier:

\[\begin{aligned} & h_{\theta}(x)=g\left(\theta^{\mathrm{T}} x\right), g(z)=\frac{1}{1+\mathrm{e}^{-z}}, \\ & \theta^{\mathrm{T}} x=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\theta_{3} x_{1}^{2}+\theta_{4} x_{1} x_{2}+\theta_{5} x_{2}^{2} \\ & +\theta_{6} x_{1}^{3}+\theta_{7} x_{1}^{2} x_{2}+\theta_{8} x_{1} x_{2}^{2}+\theta_{9} x_{2}^{3} \\ & +\theta_{10} x_{1}^{4}+\theta_{11} x_{1}^{3} x_{2}+\theta_{12} x_{1}^{2} x_{2}^{2}+\theta_{13} x_{1} x_{2}^{3}+\theta_{14} x_{2}^{4} \\ & +\theta_{15} x_{1}^{5}+\theta_{16} x_{1}^{4} x_{2}+\theta_{17} x_{1}^{3} x_{2}^{2}+\theta_{18} x_{1}^{2} x_{2}^{3}+\theta_{19} x_{1} x_{2}^{4}+\theta_{20} x_{2}^{5} \\ & +\theta_{21} x_{1}^{6}+\theta_{22} x_{1}^{5} x_{2}+\theta_{23} x_{1}^{4} x_{2}^{2}+\theta_{24} x_{1}^{3} x_{2}^{3}+\theta_{25} x_{1}^{2} x_{2}^{4}+\theta_{26} x_{1} x_{2}^{5}+\theta_{27} x_{2}^{6} \end{aligned} \]

We know that too many features may result in overfitting, so we implement regularized logistic regression to fit the data.

Regularized cost function in logistic regression

Write the regularized cost function in logistic regression;

Observe that function \(g\) is actually the sigmoid function,

The cost function is give by:

\[\begin{aligned} & \min _\theta J(\theta), \\ & J(\theta)=-\frac{1}{m} \sum_{i=1}^m\left[y^{(i)} \log h_\theta\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_\theta\left(x^{(i)}\right)\right)\right]+\frac{\lambda}{2 m} \sum_{j=1}^n \theta_j^2\end{aligned} \]

Gradients of the above cost function

Observe that the function \(h_\theta(x^{(i)}) = \frac{1}{1 + e^{-\theta^T x^{(i)}}}\) is the logistic (sigmoid) function.

we have some tricks for the sigmoid function, which are (function parameters are omitted)

\[\sigma' = \sigma \cdot (1 - \sigma) \]

\[(\log \sigma)' = \frac{1}{e^x+1} \]

\[\sigma(x) = 1-\sigma(-x) \]

by applying the tricks above we ahve that

\[\begin{aligned} &\frac{\partial }{\partial x}\left((1-y^{(i)}) \log (1-\sigma(x))+y ^{(i)})\log (\sigma(x))\right) \\ = \ & y^{(i)} - \sigma(x) \end{aligned} \]

So the gradient of \(J(\theta)\) is given by:

\[\frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_0^{(i)} \]

\[\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{\lambda}{m} \theta_j \text{ for } j = 1, \ldots, n \]

where \(x^{(i)}\) represents the feature vector of the \(i\)th training example, \(y^{(i)}\) is the corresponding label, and \(\lambda\) is the regularization parameter.

Execute the gradient descent algorithm and their corresponding plots

import numpy as np

def load_x():
    file_path = 'hw10data2.txt'
    x1 = []
    x2 = []
    y = []
    with open(file_path, 'r') as file:
        lines = file.readlines()
        for line in lines:
            parts = line.strip().split(',')
            x1.append(float(parts[0]))
            x2.append(float(parts[1]))
            y.append(float(parts[2]))
    
    return np.array(np.transpose([x1, x2])), np.array(y)

x, y = load_x()
def construct_vandermonde_matrix(x, order):
    data_size = x.shape[0]
    vander = np.zeros((data_size, (order + 1) * (order + 2) // 2))
    for current_ind in range(data_size):
        current_x = x[current_ind, :]
        index = 0
        for n1 in range(order + 1):
            for n2 in range(order + 1 - n1):
                vander[current_ind, index] = current_x[0] ** n1 * current_x[1] ** n2
                index += 1
    return vander

X = construct_vandermonde_matrix(x, 6)
print(X.shape)

(118, 28)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def hypothesis_function(theta, x_row):
    return sigmoid(theta.dot(x_row))

def h_wrap(theta, X):
    res = []
    for i in range(X.shape[0]):
        res.append(hypothesis_function(theta, X[i, :]))
    return np.array(res)
def gradient_descent(X, y, theta, learning_rate, iterations, lambd): # not finished yet
    m = len(y)
    for _ in range(iterations):
        gradient = np.empty((1, 28))
        tmp = h_wrap(theta, X) - y 
        gradient[0] = tmp.dot(X[:, 0]) / m 
        
        gradient = np.zeros(28)
        # print(f"gradient = {gradient}")
        for i in range(1, 28):
            # tmp = h_wrap(theta, X) - y 
            # print(f"tmp.dot(X[:, i]) / m = {tmp.dot(X[:, i]) / m}")
            # print(f"lambd * theta[i] / m = {lambd * theta[i] / m}")
            gradient[i] = tmp.dot(X[:, i]) / m + lambd * theta[i] / m
        theta = theta - learning_rate * gradient
    return theta
max_order = 6
n = (1 + max_order + 1) * (max_order + 1) // 2
print(f"width of the Vandermonde matrix: {n}")
# Initialize theta, learning_rate, iterations, and lambd here
theta = np.zeros(n)
learning_rate = 0.01
iterations = 10000
lambd = 0.1

# Apply the gradient descent function
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)

print("Optimized theta values:", theta)
width of the Vandermonde matrix: 28
Optimized theta values: [ 0.          1.24840339  0.08434246 -0.19608066 -0.9424621  -0.69907608
 -1.10322374  0.4983416  -0.97733676 -0.24856352 -0.26537115 -0.21632738
 -0.11254653 -0.79591403 -0.37502702 -0.47569472 -0.28680801 -0.31413115
  0.11819405 -0.1093652  -0.04540563 -0.00267605 -1.18512554 -0.2280036
 -0.26935639 -0.28587253 -0.01128536 -1.01289731]
import matplotlib.pyplot as plt

def plot_fig_con(lambd):
    x1_min, x1_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    x2_min, x2_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1),
                           np.arange(x2_min, x2_max, 0.1))
    
    grid = np.c_[xx1.ravel(), xx2.ravel()]
    X_grid = construct_vandermonde_matrix(grid, 6)

    probs = h_wrap(theta, X_grid)
    probs = probs.reshape(xx1.shape)
    
    plt.contourf(xx1, xx2, probs, alpha=0.8)
    plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k')
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title(f"lambda = {lambd}")
    plt.show()
import matplotlib.pyplot as plt

def plot_fig(lambd):
    plt.scatter(x[y == 0, 0], x[y == 0, 1], c='red', edgecolors='k', label='class 0: y = 0')
    plt.scatter(x[y == 1, 0], x[y == 1, 1], c='blue', edgecolors='k', label='class 1: y = 1')


    x1_min, x1_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    x2_min, x2_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1),
                           np.arange(x2_min, x2_max, 0.1))
    
    grid = np.c_[xx1.ravel(), xx2.ravel()]
    X_grid = construct_vandermonde_matrix(grid, 6)
    
    probs = h_wrap(theta, X_grid)
    probs = probs.reshape(xx1.shape)
    
    plt.contour(xx1, xx2, probs, levels=[0.5], colors='r', linestyles='dashed', label='Decision Boundary')
    
    plt.legend()
    
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title(f"lambda = {lambd}")
    
    plt.show()

lambd = 0 
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig_con(lambd)
plot_fig(lambd)


png

/var/folders/3j/8tl_3gps0f77ph8f010qmqtr0000gn/T/ipykernel_6658/237325556.py:19: UserWarning: The following kwargs were not used by contour: 'label'
  plt.contour(xx1, xx2, probs, levels=[0.5], colors='r', linestyles='dashed', label='Decision Boundary')

png

lambd = 1
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig_con(lambd)
plot_fig(lambd)


png

/var/folders/3j/8tl_3gps0f77ph8f010qmqtr0000gn/T/ipykernel_6658/237325556.py:19: UserWarning: The following kwargs were not used by contour: 'label'
  plt.contour(xx1, xx2, probs, levels=[0.5], colors='r', linestyles='dashed', label='Decision Boundary')

png

lambd = 100 
theta = gradient_descent(X, y, theta, learning_rate, iterations, lambd)
plot_fig_con(lambd)
plot_fig(lambd)


png

/var/folders/3j/8tl_3gps0f77ph8f010qmqtr0000gn/T/ipykernel_6658/237325556.py:19: UserWarning: The following kwargs were not used by contour: 'label'
  plt.contour(xx1, xx2, probs, levels=[0.5], colors='r', linestyles='dashed', label='Decision Boundary')

png

标签:plt,frac,Descent,Gradient,lambd,Approximation,gradient,np,theta
From: https://www.cnblogs.com/kion/p/17815223.html

相关文章

  • 梯度下降法 Method of steepest descent.
    梯度下降法是一个一阶最优化算法,通常也称为最速下降法。梯度下降法,就是利用负梯度方向来决定每次迭代的新的搜索方向,使得每次迭代能使待优化的目标函数逐步减小。梯度下降法是2范数下的最速下降法。   最速下降法的一种简单形式是:x(k+1)=x(k)-a*g(k),其中a称为学习速率,可以是较......
  • [机器学习] 3. 镜像下降 Mirror Descent 与线性耦合 Linear Coupling
    MLTheory太魔怔了!!!!!我们来考虑更快的下降算法。对\(L\)-smooth的GradientDescent,我们有两种视角来看它。一种是局部视角,梯度方向相近的点的函数值一定会下降,另一种是全局视角,用一个二次函数为整个\(f\)提供了一个lowerbound。当局部梯度的范数很大时,函数值会下降的很快;当......
  • 理论的动态发展完完备与进化:数论Number Theory数域的进化史 与 Infinite Precision无
    InfinitePrecision:(0)数是什么?以有限的数元,度量与表示无限的现象、事物与状态,作为整个数学科学理论的根基。以Binary二进制为例,有{0,1},Constant/Dynamic系统建模上,两种state(状态)?0->1与1->0代表“change变化”?而Decimal十进制,有{0,1,2,3,4,5,6,7,8,9}十种数元,运算符号......
  • OC CAGradientLayer 渐变色
       @property(nonatomic,strong)NSTimer*timer;/**<定时器*/@property(strong,nonatomic)CAGradientLayer*gradientLayer;/**<渐变层*/[self.layerinsertSublayer:self.gradientLayeratIndex:0];//初始化定时器self.timer=[NSTi......
  • P251——用RadialGradientBrush填充椭圆,并进行RotateTransform变换
    一、认识RadialGradientBrush(径向渐变)    1.坐标      RadialGradientBrush可以用来填充矩形(正方形)和椭圆(正圆),      填充区域使用比例坐标,      椭圆的坐标(0,0)和(1,1)构成的矩形内切于椭圆2.设置径向渐变颜色GradientStop<Gradi......
  • CSS gradient渐变之webkit核心浏览器下的使用
    [color=red][b]一、关于渐变[/b][/color]渐变是一种应用于平面的视觉效果,可以从一种颜色逐渐地转变成另外一种颜色,故可以创建类似于彩虹的效果渐变可以应用在任何可以使用图片的地方。例如,您可以指定一个这么一个渐变:顶部的颜色是红色,中间的是蓝色,底部为黄色来作为div的背景色。渐......
  • 监督学习算法中梯度提升决策树(Gradient Boosting Decision Trees)
    梯度提升决策树(GradientBoostingDecisionTrees,简称GBDT)是一种监督学习算法,它是以决策树为基础分类器的集成学习方法。GBDT通过迭代地训练多个弱分类器(决策树),每个弱分类器都在前一个弱分类器的残差上进行训练,从而逐步减小残差,最终将多个弱分类器组合成一个强分类器。具体而言,GB......
  • 【亲测】解决使用super_gradients库预测图片无法获取预测结果图片
    问题在使用super_gradients库中的Yolo-nas预测图片时,想要获取预测好的图片,但执行out=model.predict("camera01.png",conf=0.6,batch_size=None)之后只能out.show()和out.save(),无法返回预测结果图片。解决方法importcv2importtorchfromsuper_gradients.trainingimportmo......
  • GDI+画刷(LinearGradientBrush(三))
    3、LinearGradientBrush(Rect&,Color&,Color&,LinearGradientMode)对四种路径渐变方式进行代码效果展示,如下:变量定义:CDCMemDCIndicator;CBitmapMemBitmapIndicator;CDC*pDC;CComboBoxm_pGradientMode;关键代码:1voidLinearGradientTwoDlg::DoDataE......
  • GDI+画刷(LinearGradientBrush(二))
    2、LinearGradientBrush(Rect&,Color&,Color&,REAL,BOOL),代码演示结果如下所示:注:Rect规定画刷一次所绘制的范围,REAL值是指定颜色渐变的方向变量定义:1CDCMemDCIndicator;2CBitmapMemBitmapIndicator;3CDC*pDC;45REALm_pRealValue;6CSliderCtrlm_pSlid......