首页 > 其他分享 >Supervised Machine Learning : Regression and Classification

Supervised Machine Learning : Regression and Classification

时间:2023-12-28 20:11:08浏览次数:45  
标签:function frac gradient sum Machine Supervised alpha partial Regression

The course is available at : Supervised Machine Learning: Regression and Classification - Week 1: Introduction to Machine Learning - Week 1 | Coursera

Regression Model

The cost is a measure how well our model is predicting the target. The following formula shows the cost function of linear regression with one variable:

$$J(w,b)=\frac{1}{2m}\sum_{i=1}{m}(f_{w,b}(x)-y{(i)})2$$

Note that we divide m to obtain an average, and further divide 2 for make calculations neater.

The fact that the cost function squares the loss ensures that the 'error surface' is convex like a soup bowl. It will always have a minimum that can be reached by following the gradient in all dimensions.

Gradient Descent

Repeat the following steps until convergence( $\partial$ refers to the partial derivative):

$$w = w-\alpha\frac{\partial}{\partial w}J(w, b)$$

$$b = b-\alpha\frac{\partial}{\partial b}J(w,b)$$

Here are some explanations:

  • Simultaneous update : you calculate the partial derivatives for all the parameters before updating any of the parameters.
  • Obtain local minimum : when we reach the local minimum, the derivative will be zero and there's no further movement.
  • Appropriate learning rate : small alpha may be slow, while large alpha result in overshooting.
  • Fixed learning rate : the length of a single step varies with the partial derivative. So we can reach the local minimum with a fixed learning rate.

Let's simplify the formula above and see what we get(using partial derivative rules):

$$w = w-\alpha\frac{1}{m}\sum_{i=1}^m (f_{w,b}(x{(i)})-y)x^{(i)}$$

$$b = b-\alpha\frac{1}{m}\sum_{i=1}^m (f_{w,b}(x{(i)})-y)$$

How to solve the gradient descent with multiple variables. Actually, it is as much same as the linear regression.

$$w_j = w_j - \alpha \frac{\partial J(\text w,b)}{\partial w_j}$$

$$b = b - \alpha\frac{\partial J(\text w, b)}{\partial b}$$

where, n is the number of features, parameters $w_j, b$ are updated simultaneously and where:

$$\frac{\partial J(\text w,b)}{\part w_j} = \frac{1}{m}\sum_{i=1}^m (f_{w,b}(x{(i)})-y)\times x_j^{(i)}$$

$$\frac{\partial J(\text w,b)}{\part b} = \frac{1}{m}\sum_{i=1}^m (f_{w,b}(x{(i)})-y)$$

The implement of the gradient is under the file graident descent. You can check the code as you want.

Here are some advice when debugging the algorithm :

  • For sufficiently small $\alpha$, $J(w, b)$ should decrease on every iteration.

  • Convergence test : if $|\Delta J(w,b)|\leq 10^{-3}$, we declare that the model reaches convergence.

Feature Scaling

Feature scaling refers to rescaling the dataset so the features have a similar range.

Assume $\mu_j$ is the mean of all the values for feature $(j)$, $\sigma_j$ is the standard deviation of feature $(j)$. To implement z-score normalization, adjust your input values as shown in this formula:

$$x_j^{(i)} = \frac{x_j^{(i)} - \mu_j}{\sigma_j}$$

Implementation Note : Not only should we rescale the training data, but also the testing ones.

Still, we have another normalization approach called Min-Max normalization :

$$x_j^{(i)} = \frac{x^{(i)}_j-\min}{\max-\min}$$

Feature Engineering

Feature engineering refers to using domain knowledge to design new features by transforming a feature or combining features.

We'll start with a simple quadratic: $y = 1 + x^2$ :

The graph suggests the linear regression is not a great fit. What we need is something like $y = wx^2 + b$, or a polynomial feature. To accomplish this, we can replace $x$ with $x^2$ :

Near perfect fit! Although we know that an $x^2$ term was required, it may not always be obvious which features are required. One could add a variety of potential features to try and find the most useful. For example, if we get the hypothesis:

$$0.08x + 0.54 x^2 + 0.03x^3 + 0.0106$$

Gradient descent is picking the "correct" features for us by emphasizing its associated parameter, since the weight associated with $x^2$ feature is much larger than the weight associated with $x$ or $x^3$ feature.

If the data set has features with significantly different scales, one should apply feature scaling to speed gradient descent. So we apply feature scaling to the polynomial features.

Logistic regression

Let's discuss a classic problem: binary classification.

It turns out that the linear regression is not suitable for classification problem. However, we would like the predictions of our classification model to be between 0 and 1 since our output variable y is either 0 or 1.

This can be accomplished by using a "sigmoid function" (or logistic function) :

$f(x{(i)})=\frac{1}{1+e\vec w+b)}}$

When predicting, if $f(x^{(i)})\geq 0.5$, we predict the output is 1. Otherwise we predict 0.

Now let's dive into the loss function and the cost function. Note that loss is a measure of the difference of a single example to its target value while the cost is a measure of the losses over the training set.

If we apply the squared error cost function into the logistic regression, it turns out that the cost function will be non-convex, which is not appropriate for gradient descent. We need a new loss function:

$$loss(i)=\begin{cases}-\log(f(x^{(i)})) & y{(i)}=1\-\log(1-f(x)) & y^{(i)}=0\end{cases}$$

The loss function above can be rewritten to be easier to implement:

$$loss(i) = -y{(i)}\log(f(x))-(1-y{(i)})\log(1-f(x))$$

$$J(w,b)=-\frac{1}{m}\sum_{i=1}m[y\log(f(x{(i)}))+(1-y)\log(1-f(x^{(i)}))]$$

Recall the gradient descent algorithm utilizes the gradient calculation:

$$w_j = w_j - \alpha\frac{\partial J(w,b)}{\partial w_j}$$

$$b = b - \alpha \frac{\partial J(w,b)}{\partial b}$$

Let's get down to the $\frac{\partial J(w,b)}{\partial w_j}$ and see what we can get :

$$\begin{align}\frac{\partial J(w,b)}{\partial w_j}&=-\frac{1}{m}\sum_{i=1}^m\frac{y}{f(x)}-\frac{1-y}{1-f(x)}\cdot f'(x)\&=-\frac{1}{m}\sum_{i=1}^m\frac{y-f(x)}{f(x)(1-f(x))}\cdot f'(x)\end{align}$$

There is a quite interesting conclusion. For $g(z) = \frac{1}{1+e^{-z}}$, we have $g'(z)=\frac{e{-z}}{(1+e)^2} = g(z)(1-g(z))$. Using chain rule, we can rewritten the expression into the following form:

$$\begin{align}\frac{\partial J(w,b)}{\partial w_j}&=-\frac{1}{m}\sum_{i=1}^m \frac{y-f(x)}{f(x)(1-f(x))}\cdot f(x)(1-f(x))\cdot x_j\&=\frac{1}{m}\sum_{i=1}^m (f(x)-y)\cdot x_j\end{align}$$

Ultimately, the gradient calculation looks like this:

$$w_j = w_j-\frac{1}{m}\sum_{i=1}^m (f(x{(i)})-y)\cdot x_j^{(i)}$$

$$b = b - \frac{1}{m}\sum_{i=1}^m (f(x{(i)})-y)$$

That's unbelievable! The gradient descent is completely the same as the one in linear regression. Maybe the only difference is the definition of loss function. But it's still amazing!

Regularization

I would like you to think about how to address overfitting. Here are some useful advice:

  • Collect more data.
  • Select features (or more Specifically, choose a subset)
  • Reduce the size of parameters —— Regularization

Regularization discourages large weights for particular feature. Since we know that simpler model is less likely to overfit, let's penalize all of the weights a bit and shrink them by modifying the cost function:

$$J(\vec w, b)= \frac{1}{2m}\sum_{i=1}^m(f_{\vec w,b}(x{(i)})-y)2+\frac{\lambda}{2m}\sum_{i=1}n w_j^2$$

When implementing gradient descent, we just need to change the expression a bit:

$$w_j = w_j - \alpha\frac{\lambda}{m} w_j- \alpha\frac{1}{m}\sum_{i=1}m(f(x)-y^{(i)})\cdot x_j^{(i)}$$

From a mathematical perspective, the effect of this term is that on every single iteration of gradient descent, we are multiplying the $w_j$ by a number sightly less than 1.

What I mentioned above explain how regularization works in linear regression. It is as much same as the one in logistic regression. Regularization promotes model simplicity and stability in order to avoid overfitting.

标签:function,frac,gradient,sum,Machine,Supervised,alpha,partial,Regression
From: https://www.cnblogs.com/revue-starlight/p/17933462.html

相关文章

  • 测试开发 | 人工智能无监督学习(Unsupervised Learning)
    无监督学习是人工智能领域中备受关注的学习方式之一,其独特之处在于不依赖标签数据进行训练。本文将深入介绍无监督学习的定义、原理、应用领域以及未来发展趋势。1.无监督学习的定义无监督学习是一种机器学习范式,其目标是从未标记的数据中发现模式、结构和规律,而不像监督学习那样......
  • Azure Machine Learning的API和SDK:实现高效开发
    1.背景介绍AzureMachineLearning是一个云端服务,可以帮助数据科学家和机器学习工程师更快地构建、训练和部署机器学习模型。它提供了一套可扩展的工具和API,以便开发人员可以轻松地将机器学习功能集成到其他应用程序中。在本文中,我们将深入了解AzureMachineLearning的API和SDK,以......
  • 【五期李伟平】CCF-A(S&P'20)The Value of Collaboration in Convex Machine Learning w
    NanW.,etal.“TheValueofCollaborationinConvexMachineLearningwithDifferentialPrivacy.”2020IEEESymposiumonSecurityandPrivacy.304-317.  联邦学习场景中,在适应度函数平滑、强凸、利普斯特连续的条件下,估算各客户端使用不同隐私预算时最终全局模......
  • Data Querying in the Age of Machine Learning
    1.背景介绍随着数据的增长和复杂性,数据查询技术已经从传统的关系型数据库查询发展到了机器学习时代。机器学习技术为数据查询提供了更高效、更智能的方法,以满足当今数据驱动的企业和组织的需求。在这篇文章中,我们将探讨数据查询在机器学习时代的核心概念、算法原理、实例代码和未来......
  • GPT-2 《Language Models are Unsupervised Multitask Learners》解读
    背景GPT1采用了pre-train+fine-tuning训练方式,也就是说为了适应不同的训练任务,模型还是需要在特定任务的数据集上微调,仍然存在较多人工干预的成本。GPT-2想彻底解决这个问题,通过zero-shot,在迁移到其他任务上的时候不需要额外的标注数据,也不需要额外的模型训练。 训练数据......
  • CRC-Aided Sparse Regression Codes for Unsourced Random Access
    一、摘要随记仅用于个人对论文的分析、初步复现。1.1文件夹介绍随机包含了一篇论文的仿真结果的源代码,该论文的标题是"CRC-aidedSpareRegressionCodesforUnsourcedRandomAccess"。源代码CRC-aided_SPARCs_for_URA-main,一共包括三个文件夹:"CRC-BMSTcodesforst......
  • AI自监督学习(Self-Supervised Learning,SSL)
    AI自监督学习(Self-SupervisedLearning,SSL)是一种机器学习方法,用于训练模型从大量无标签数据中自动学习特征表示。自监督学习与传统监督学习不同之处在于,它不需要人工标注数据,而是使用数据本身作为监督信号来学习有效的特征表示。自监督学习在各种AI任务中具有广泛应用前景,如自然语......
  • How to Master the Popular DBSCAN Clustering Algorithm for Machine Learning
    OverviewDBSCANclusteringisanunderratedyetsuperusefulclusteringalgorithmforunsupervisedlearningproblemsLearnhowDBSCANclusteringworks,whyyoushouldlearnit,andhowtoimplementDBSCANclusteringinPythonIntroductionMasteringunsu......
  • 论文阅读-Self-supervised and Interpretable Data Cleaning with Sequence Generativ
    1.GARF简介代码地址:https://github.com/PJinfeng/Garf-master基于SeqGAN提出了一种自监督、数据驱动的数据清洗框架——GARF。GARF的数据清洗分为两个步骤:规则生成(RulegenerationwithSeqGAN):利用SeqGAN学习数据中的关系(datarelationship)。然后利用SeqGAN中......
  • BigdataAIML-ML-Models for machine learning Explore the ideas behind machine lear
    最好的机器学习教程系列:https://developer.ibm.com/articles/cc-models-machine-learning/ByM.TimJones,PublishedDecember4,2017ModelsformachinelearningAlgorithmsusedinmachinelearningfallroughlyintothreecategories:supervised,unsupervised,and......