Declaration (2023/06/02):
This note is the first note of a series of machine learning notes.
At present, the main learning resource is the 2022 Andrew Y. Ng machine learning Deeplearning.ai course, from which most of the knowledge and some pictures in the notes come.
For convenience, the video resource comes from 【啥都会一点的研究生】 on Bilibili and course materials is pulled from GitHub.
Here I would recommend watching the original video.
- Course address: https://www.coursera.org/specializations/machine-learning-introduction
- Video address: https://www.bilibili.com/video/BV1Pa411X76s/?p=1
- GitHub resource: https://github.com/kaieye/2022-Machine-Learning-Specialization
Machine Learning 【note_01】
Linear Regression
1 Basic knowledge
1.1 machine learning
-
supervised learning
-
used most in real-world applications
-
most rapid advancement and innovation
-
-
unsupervised learning
reinforcement learning
practical advice for applying learning algorithms
1.2 supervised learning
- Input x and y pairs, get mapping relationship, calculate y for new x.
- Learns from being given "right answers".
- two kinds:
- regression:
- predict a number
- infinitely many possible output
- classification:
- predict categories
- small number of possible output
- regression:
1.3 unsupervised learning
-
Find something interesting in unlabeled data.
-
Input x and label y, find structure in data, to get label y of x without given label.
-
three kind:
- Clustering: group similar data points together.
-
Examples: Google news, DNA micro-array, grouping customers
2. Anomaly detection: find unusual data points.
- Dimensionality reduction: Compress data using fewer numbers while losing as little information as possible.
2 Linear regression
2.1 Definition
Supervised learning models data has "right answers".
Regression model predicts numbers.
2.2 An example
2.3 Terminology
- Training set: Data used to train the model.
- \(x\): "input" variable/feature
- \(y\): "output/target" variable/feature
- \(m\): number of training examples
- \((x,y)\): single training example
- \((x^{(i)},y^{(i)})\): \(i^{th}\) training example
- Model
- \(f\): model/function, take a new input x and output the prediction for y
- \(\hat{y}\): estimate/prediction for y
- \(w\) and \(b\): parameters of the model, coefficients / weights and intercept
Linear regression with one variable.
Univariate linear regression.
3 Cost Function
3.1 Definition
Cost function: Squared error cost function
习惯上,把代价函数 \(J(w,b)\) 的系数做 \(1/2\) 计算,可以在求导的时候方便一些:
\[J(w,b)=\frac{1}{2m}\sum_{i=1}^m{({\hat{y}}^{(i)}-{y}^{(i)})^2} \]设定代价函数的目的:minimize \(J(w,b)\)
3.2 Visualization
为了便于理解代价函数,可以做以下可视化:
3.2.1 J(w)
Set \(b\) as zero, and see how \(J(w)\) is plotted.
3.2.2 J(w,b)
See how \(J(w,b)\) is visualized.
4 Gradient Descent
4.1 Conception
goal: minimize \(J(w_1,w_2,...,w_n,b)\)
outline:
- start with some \(w\) and \(b\)
- keep changing \(w\), \(b\) to reduce \(J(w_1,w_2,...,w_n,b)\)
- until settle at or near a minimum
There may be more than one possible minimum.
gradient descent:
local minima:
4.2 Implementation
gradient descent algorithm
repeat until convergence {
\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]}
Here, the \(\alpha\) is called the learning rate. The \(\frac{\partial}{\partial{w}}J(w,b)\) is a partial derivative term.
Simultaneously update \(w\) and \(b\).
4.3 Understanding
When gradient descent algorithm run, \(w\) is decreased like below:
If \(\alpha\) is too small, gradient descent may be slow.
If \(\alpha\) is too small, gradient descent may:
- overshoot, never reach minimum
- fail to converge, diverge
Can reach local minimum with fixed learning rate:
When it has already reached a minimum, gradient descent leaves \(w\) unchanged:
4.4 For linear regression
Linear regression model: \(f_{w,b}(x)=wx+b\)
Cost function: \(J(w,b)=\frac{1}{2m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})^2}\)
gradient descent algorithm
repeat until convergence {
\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]}
Here,
\[\frac{\partial}{\partial{w}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)x^{(i)}} \\ \frac{\partial}{\partial{b}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)} \]Linear regression's cost function always have only one global minimum:
4.5 Running
See how gradient descent running in an example:
"Batch" gradient descent
- "Batch": each step of gradient descent uses all the training examples (instead of just a subset of the training data).
5 Multiple Linear Regression
5.1 Conception
multiple features (variables)
- \(x_j\): \(j^{th}\) feature
- \(n\): number of features
- \({\vec{x}}^{(i)}\): features of \(i^{th}\) training example
- \({x_j}^{(i)}\): value of feature \(j\) in \(i^{th}\) training example
Model: \(f_{w,b}(x)=w_1x_1+w_2x_2+...+w_nx_n+b\)
- \(\vec{w}=[w_1,w_2,...,w_n]\)
- \(b\)
- \(\vec{x}=[x_1,x_2,...,x_n]\)
So, it could be written as :
\[f_{\vec{w},b}(\vec{x})=\vec{w}\cdot\vec{x}+b \]5.2 Vectorization
In Python, we use NumPy
to calculate vectorized parameters and features:
- Run faster
- Shorter of code
Compare running with and without vectorization:
- Parallel computing, is more efficient for large scale datasets.
Same in gradient descent:
Try in Python code:
np.random.seed(1)
a = np.random.rand(10000000) # very large arrays
b = np.random.rand(10000000)
tic = time.time() # capture start time
c = np.dot(a, b)
toc = time.time() # capture end time
print(f"np.dot(a, b) = {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")
tic = time.time() # capture start time
c = my_dot(a,b)
toc = time.time() # capture end time
print(f"my_dot(a, b) = {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")
del(a);del(b) #remove these big arrays from memory
np.dot(a, b) = 2501072.5817
Vectorized version duration: 10.9415 ms
my_dot(a, b) = 2501072.5817
loop version duration: 2575.1283 ms
5.3 Gradient Descent
Model: \(f_{\vec{w},b}=\vec{w}\cdot\vec{x}+b\)
Cost function: \(J(\vec{w},b)\)
To calculate partial derivative term:
There's an alternative to gradient descent, which is called Normal Equation:
-
Only for linear regression.
-
Solve for \(w\), \(b\) without iterations.
-
Disadvantages:
- Does not generalize to other learning algorithms (like Logistic Regression or other subsequent machine learning methods).
- Slow when number of features is large (>10,000).
-
It may be used in ML libraries that implement linear regression; however, gradient descent is the recommended method for finding parameters \(w\), \(b\).
5.4 Important things
5.4.1 Feature Scaling
5.4.1.1 Why
An example:
We can see that the scales of different variables vary greatly, which may cause problems in gradient descent:
So we need to rescale feature:
5.4.1.2 How
Simply divide by the maximum value:
Mean normalization:
Z-score normalization:
rescale it when the feature parameters are too large or too small:
5.4.2 Convergence check
Check by the curve:
- The cost should decrease after every iteration.
- If the cost curve rises, it means the learning rate \(\alpha\) chosen is too big.
- The cost likely converged by 400 iterations.
- It is difficult to know in advance the number of iterations required.
Or, we can set the automatic convergence test.
- Set the "epsilon" \(\epsilon\) to be \(10^{-3}\).
- If cost decreases by \(\leq\epsilon\) in one iteration, declare convergence.
- It is hard to find an proper "epsilon", so we'd better choose the left strategy rather than right one.
5.4.3 Learning Rate Choose
We make sure that the cost should decrease after every iteration.
There's a strategy:
- First choose a very small \(\alpha\), to see if the cost decrease after every iteration.
- Gradually increase the \(\alpha\) to see the cost curve's change.
5.4.4 Feature Engineering
Using intuition to design new features, by transforming or combining original features.
6 Polynomial Regression
Here, feature scaling is quite important.
How to choose different features and different models?
标签:note,01,partial,descent,gradient,Machine,vec,learning,time From: https://www.cnblogs.com/QingMu-0722/p/17450982.html