首页 > 其他分享 >Machine Learning 【note_01】

Machine Learning 【note_01】

时间:2023-06-02 10:13:37浏览次数:50  
标签:note 01 partial descent gradient Machine vec learning time

Declaration (2023/06/02):

This note is the first note of a series of machine learning notes.

At present, the main learning resource is the 2022 Andrew Y. Ng machine learning Deeplearning.ai course, from which most of the knowledge and some pictures in the notes come.

For convenience, the video resource comes from 【啥都会一点的研究生】 on Bilibili and course materials is pulled from GitHub.

Here I would recommend watching the original video.

Machine Learning 【note_01】

Linear Regression


1 Basic knowledge

1.1 machine learning

  • supervised learning

    • used most in real-world applications

    • most rapid advancement and innovation

  • unsupervised learning

reinforcement learning

practical advice for applying learning algorithms

1.2 supervised learning

  • Input x and y pairs, get mapping relationship, calculate y for new x.
  • Learns from being given "right answers".
  • two kinds:
    1. regression:
      • predict a number
      • infinitely many possible output
    2. classification:
      • predict categories
      • small number of possible output

1.3 unsupervised learning

  • Find something interesting in unlabeled data.

  • Input x and label y, find structure in data, to get label y of x without given label.

  • three kind:

    1. Clustering: group similar data points together.
  • Examples: Google news, DNA micro-array, grouping customers
    2. Anomaly detection: find unusual data points.

  1. Dimensionality reduction: Compress data using fewer numbers while losing as little information as possible.

2 Linear regression

2.1 Definition

Supervised learning models data has "right answers".

Regression model predicts numbers.

2.2 An example

image-20230531123012578

2.3 Terminology

  • Training set: Data used to train the model.
    • \(x\): "input" variable/feature
    • \(y\): "output/target" variable/feature
    • \(m\): number of training examples
    • \((x,y)\): single training example
    • \((x^{(i)},y^{(i)})\): \(i^{th}\) training example
  • Model
    • \(f\): model/function, take a new input x and output the prediction for y
    • \(\hat{y}\): estimate/prediction for y
    • \(w\) and \(b\): parameters of the model, coefficients / weights and intercept
graph TD A(training set)-->B(learning algorithm)-->C(function) D(x)-->C-->E(y-hat) image-20230531125937232

Linear regression with one variable.

Univariate linear regression.


3 Cost Function

3.1 Definition

Cost function: Squared error cost function

image-20230531233603163

习惯上,把代价函数 \(J(w,b)\) 的系数做 \(1/2\) 计算,可以在求导的时候方便一些:

\[J(w,b)=\frac{1}{2m}\sum_{i=1}^m{({\hat{y}}^{(i)}-{y}^{(i)})^2} \]

设定代价函数的目的:minimize \(J(w,b)\)

3.2 Visualization

为了便于理解代价函数,可以做以下可视化:

3.2.1 J(w)

Set \(b\) as zero, and see how \(J(w)\) is plotted.

image-20230601000159186

3.2.2 J(w,b)

See how \(J(w,b)\) is visualized.

image-20230601000242286 image-20230601001338536 image-20230601002325581

4 Gradient Descent

4.1 Conception

goal: minimize \(J(w_1,w_2,...,w_n,b)\)

outline:

  • start with some \(w\) and \(b\)
  • keep changing \(w\), \(b\) to reduce \(J(w_1,w_2,...,w_n,b)\)
  • until settle at or near a minimum

There may be more than one possible minimum.

gradient descent:

image-20230601003444122

local minima:

image-20230601003620237

4.2 Implementation

gradient descent algorithm

​ repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here, the \(\alpha\) is called the learning rate. The \(\frac{\partial}{\partial{w}}J(w,b)\) is a partial derivative term.

Simultaneously update \(w\) and \(b\).

image-20230601004914913

4.3 Understanding

When gradient descent algorithm run, \(w\) is decreased like below:

image-20230601010122298

If \(\alpha\) is too small, gradient descent may be slow.

If \(\alpha\) is too small, gradient descent may:

  • overshoot, never reach minimum
  • fail to converge, diverge
image-20230601010815367

Can reach local minimum with fixed learning rate:

image-20230601011323627

When it has already reached a minimum, gradient descent leaves \(w\) unchanged:

image-20230601011301848

4.4 For linear regression

Linear regression model: \(f_{w,b}(x)=wx+b\)

Cost function: \(J(w,b)=\frac{1}{2m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})^2}\)

gradient descent algorithm

​ repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here,

\[\frac{\partial}{\partial{w}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)x^{(i)}} \\ \frac{\partial}{\partial{b}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)} \]

image-20230601124420645

Linear regression's cost function always have only one global minimum:

image-20230601124605521

4.5 Running

See how gradient descent running in an example:

image-20230601125031569

"Batch" gradient descent

  • "Batch": each step of gradient descent uses all the training examples (instead of just a subset of the training data).

5 Multiple Linear Regression

5.1 Conception

image-20230601132234905

multiple features (variables)

  • \(x_j\): \(j^{th}\) feature
  • \(n\): number of features
  • \({\vec{x}}^{(i)}\): features of \(i^{th}\) training example
  • \({x_j}^{(i)}\): value of feature \(j\) in \(i^{th}\) training example
image-20230601132509260

Model: \(f_{w,b}(x)=w_1x_1+w_2x_2+...+w_nx_n+b\)

  • \(\vec{w}=[w_1,w_2,...,w_n]\)
  • \(b\)
  • \(\vec{x}=[x_1,x_2,...,x_n]\)

So, it could be written as :

\[f_{\vec{w},b}(\vec{x})=\vec{w}\cdot\vec{x}+b \]

5.2 Vectorization

In Python, we use NumPy to calculate vectorized parameters and features:

  • Run faster
  • Shorter of code
image-20230601133700029

Compare running with and without vectorization:

  • Parallel computing, is more efficient for large scale datasets.
image-20230601134205901

Same in gradient descent:

image-20230601134546341

Try in Python code:

np.random.seed(1)
a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time()  # capture start time
c = np.dot(a, b)
toc = time.time()  # capture end time

print(f"np.dot(a, b) =  {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
c = my_dot(a,b)
toc = time.time()  # capture end time

print(f"my_dot(a, b) =  {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a);del(b)  #remove these big arrays from memory
np.dot(a, b) =  2501072.5817
Vectorized version duration: 10.9415 ms 
my_dot(a, b) =  2501072.5817
loop version duration: 2575.1283 ms 

5.3 Gradient Descent

Model: \(f_{\vec{w},b}=\vec{w}\cdot\vec{x}+b\)

Cost function: \(J(\vec{w},b)\)

image-20230601141551927

To calculate partial derivative term:

image-20230601141838070

There's an alternative to gradient descent, which is called Normal Equation:

  • Only for linear regression.

  • Solve for \(w\), \(b\) without iterations.

  • Disadvantages:

    • Does not generalize to other learning algorithms (like Logistic Regression or other subsequent machine learning methods).
    • Slow when number of features is large (>10,000).
  • It may be used in ML libraries that implement linear regression; however, gradient descent is the recommended method for finding parameters \(w\), \(b\).

5.4 Important things

5.4.1 Feature Scaling

5.4.1.1 Why

An example:

image-20230601143413294

We can see that the scales of different variables vary greatly, which may cause problems in gradient descent:

image-20230601143432689

So we need to rescale feature:

image-20230601143637077
5.4.1.2 How

Simply divide by the maximum value:

image-20230601144009784

Mean normalization:

image-20230601144118845

Z-score normalization:

image-20230601144239742

rescale it when the feature parameters are too large or too small:

image-20230601144456683

5.4.2 Convergence check

Check by the curve:

  • The cost should decrease after every iteration.
    • If the cost curve rises, it means the learning rate \(\alpha\) chosen is too big.
  • The cost likely converged by 400 iterations.
    • It is difficult to know in advance the number of iterations required.
image-20230601145155098

Or, we can set the automatic convergence test.

  • Set the "epsilon" \(\epsilon\) to be \(10^{-3}\).
  • If cost decreases by \(\leq\epsilon\) in one iteration, declare convergence.
  • It is hard to find an proper "epsilon", so we'd better choose the left strategy rather than right one.

5.4.3 Learning Rate Choose

We make sure that the cost should decrease after every iteration.

image-20230601150210340

There's a strategy:

  • First choose a very small \(\alpha\), to see if the cost decrease after every iteration.
  • Gradually increase the \(\alpha\) to see the cost curve's change.
image-20230601150653130

5.4.4 Feature Engineering

Using intuition to design new features, by transforming or combining original features.

image-20230601153559194

6 Polynomial Regression

image-20230601153954882

Here, feature scaling is quite important.

How to choose different features and different models?

image-20230601154125065

标签:note,01,partial,descent,gradient,Machine,vec,learning,time
From: https://www.cnblogs.com/QingMu-0722/p/17450982.html

相关文章

  • https://blog.csdn.net/weixin_58018769/article/details/130380746
     :move="onMove" onMove(e){console.log(e);letmoveName=e.draggedContext.element.name//这个是当前拖拽的控件名//接下来判断该控件要不要拖进指定的容器if(moveName=='禁止拖拽进容器的控件名'&&e.to._prevClass=='要拖进的容器的class名......
  • 20230601
    日期2023-06-01连续总结第一天事项星盟Re入门视频1个,总结思维导图CTFshowRe入门题安装Ollydbg,解决问题......
  • Nginx:CVE-2021-23017;CVE-2022-41742
    nginx安全漏洞(CVE-2021-23017)详细描述Nginx是美国Nginx公司的一款轻量级Web服务器/反向代理服务器及电子邮件(IMAP/POP3)代理服务器。nginx存在安全漏洞,该漏洞源于一个离一错误在该漏洞允许远程攻击者可利用该漏洞在目标系统上执行任意代码。受影响版本0.6.18-1.20.0解决......
  • 洛谷P1587 [NOI2016] 循环之美
    原文链接:[sol]P1587[NOI2016]循环之美-icyM3tra'sBlog此为备份。题目传送门0.前言提供一种式子较为简洁且不算难写的做法。以及介绍一种较为优秀的dp版杜教筛实现。1.式子推导我们知道纯循环小数一定可以写成分母形如\(k^a-1\)的分数。因此约分后的分母一定......
  • [CEOI2017] Mousetrap
    100黑祭。首先以终点为根。先考虑简单一点的情况:如果起点终点相邻,那么方案一定是让老鼠先走到一个叶子节点,然后断掉该节点到根路径上其它的分支。于是我们令\(f_i\)表示从\(i\)开始走到\(i\)子树里的一个叶节点再返回所需的最小代价,每次dp从儿子里的次大值转移即可。考虑......
  • 2023-06-01:讲一讲Redis常见数据结构以及使用场景。
    2023-06-01:讲一讲Redis常见数据结构以及使用场景。答案2023-06-01:字符串(String)适合场景缓存功能Redis作为缓存层,MySQL作为存储层,在大部分请求中,数据的读取通常是从Redis中获取。由于Redis具有支持高并发的特性,因此缓存层通常能够加速读写操作,并减轻后端服务的压力。采用Redi......
  • 2023-06-01:讲一讲Redis常见数据结构以及使用场景。
    2023-06-01:讲一讲Redis常见数据结构以及使用场景。答案2023-06-01:字符串(String)适合场景缓存功能Redis作为缓存层,MySQL作为存储层,在大部分请求中,数据的读取通常是从Redis中获取。由于Redis具有支持高并发的特性,因此缓存层通常能够加速读写操作,并减轻后端服务的压力。采......
  • 代码随想录算法训练营第二十二天|235. 二叉搜索树的最近公共祖先,701. 二叉搜索树中的
    [参考链接]235.二叉搜索树的最近公共祖先[注意]1.因为是有序树,所以如果中间节点是q和p的公共祖先,那么中间节点的数组一定是在[p,q]区间的。即中节点>p&&中节点<q或者中节点>q&&中节点<p。2.那么只要从上到下去遍历,遇到cur节点是数值在[p,q]区间中则一......
  • CF101234A Hacker Cups and Balls【二分+线段树】
    Description给一个长度为n的排列,对它做m次操作,每次对[l,r]区间内进行升序/降序排序。问最后的序列处于最中心的数是多少(n为奇数)。Solution是一类没有写过的题,参考题解。二分答案,对于当前的mid,将大于等于mid的数设置为1,小于mid的数设置为0。这样一来,叶结点的值......
  • spring StateMachine简易使用
    SpringStateMachine是一个状态机框架,在Spring框架项目中,开发者可以通过简单的配置就能获得一个业务状态机,而不需要自己去管理状态机的定义、初始化等过程。今天这篇文章,我们通过一个案例学习下SpringStateMachine框架的用法。 pom依赖中添加<parent><groupId>org.......