[CMU16-745] Lecture 3 Optimization Part 1

标签：Newton partial Optimization nabla frac Part Delta Lecture method

Pendulum System Diagram
Source: CMU 16-745 Study Notes, taught by Prof. Zac Manchester

Lecture2 Dynamics Discretization and Stability

Content

Review
- Discrete-time Dynamics/Simulations
Optimization Pt.1
References

Review

Discrete-time Dynamics/Simulations

Stability of Discrete-time Systems
Forward/Backward Euler Methods
RK4 Method
Zero/First Order Hold on Controls

Optimization Pt.1

Notation

Scalar Function f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):Rn→R

∂ f ∂ x = [ ∂ f ∂ x 1 ⋯ ∂ f ∂ x n ] ∈ R 1 × n \frac{\partial f}{\partial x} = \left[\frac{\partial f}{\partial x_1} \cdots \frac{\partial f}{\partial x_n}\right] \in \mathbb{R}^{1 \times n} ∂x∂f=[∂x1∂f⋯∂xn∂f]∈R1×nis a row vector.

This is because ∂ f ∂ x \frac{\partial f}{\partial x} ∂x∂fis the linear operator mapping Δ x \Delta x Δx into Δ f \Delta f Δf:

f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x f(x+Δx)≈f(x)+∂x∂fΔx

Vector Function g ( y ) : R m → R n g(y): \mathbb{R}^m \rightarrow \mathbb{R}^n g(y):Rm→Rn

∂ g ∂ y ∈ R n × m \frac{\partial g}{\partial y} \in \mathbb{R}^{n \times m} ∂y∂g∈Rn×m and similarly:

g ( y + Δ y ) ≈ g ( y ) + ∂ g ∂ y Δ y g(y + \Delta y) \approx g(y) + \frac{\partial g}{\partial y} \Delta y g(y+Δy)≈g(y)+∂y∂gΔy

Chain Rule:

f ( g ( y + Δ y ) ) ≈ f ( g ( y ) ) + ∂ f ∂ x ∣ g ( y ) ⋅ ∂ g ∂ y ∣ y ⋅ Δ y f(g(y + \Delta y)) \approx f(g(y)) + \left.\frac{\partial f}{\partial x}\right|_{g(y)} \cdot \left.\frac{\partial g}{\partial y}\right|_y \cdot \Delta y f(g(y+Δy))≈f(g(y))+∂x∂f g(y)⋅∂y∂g y⋅Δy

For convenience, we will define the Jacobian matrix as:

∇ f ( x ) = ( ∂ f ∂ x ) T ∈ R n × 1 \nabla f(x) = \left( \frac{\partial f}{\partial x} \right)^T \in \mathbb{R}^{n \times 1} ∇f(x)=(∂x∂f)T∈Rn×1

and the Hessian matrix as:

∇ 2 f ( x ) = ∂ ∂ x ( ∇ f ( x ) ) = ∂ 2 f ∂ x 2 ∈ R n × n \nabla^2 f(x) = \frac{\partial}{\partial x} \left( \nabla f(x) \right) = \frac{\partial^2 f}{\partial x^2} \in \mathbb{R}^{n \times n} ∇2f(x)=∂x∂(∇f(x))=∂x2∂2f∈Rn×n

Approximation:
f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x + 1 2 δ x T ∂ 2 f ∂ x 2 Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x + \frac{1}{2} \delta x^T \frac{\partial^2 f}{\partial x^2} \Delta x f(x+Δx)≈f(x)+∂x∂fΔx+21δxT∂x2∂2fΔx

Δ x \Delta x Δx typically represents a finite change or increment of a variable. It denotes the difference or change in a variable over a specific range. In the context of differentiation and linearization, Δ x \Delta x Δx is the actual change between two points, often represented as a vector. For example, in the linear approximation of a function, Δ x \Delta x Δx represents the difference from x x x to x + Δ x x + \Delta x x+Δx, and it is a finite increment.
δ x \delta x δx typically represents an infinitesimal change, or a very small variation, and is often used to indicate small changes in differential operations. In Taylor expansions, δ x \delta x δx represents an infinitesimal change and is used to approximate second-order terms in functions.For example, in a second-order Taylor expansion, δ x \delta x δx represents an infinitesimal change and is associated with second derivatives, representing the second-order approximation of a function’s value.

Δ x \Delta x Δx is a finite increment, typically representing the actual change in a variable.
δ x \delta x δx is an infinitesimal increment, usually used to describe small changes, especially in Taylor expansions and second-order approximations.

Root Finding

Given f ( x ) f(x) f(x), find x ∗ x^* x∗ such that f ( x ∗ ) = 0 f(x^*) = 0 f(x∗)=0.
Example: equilibrium point of a continuous-time dynamics.
Given f ( x ) f(x) f(x), find x ∗ x^* x∗such that f ( x ∗ ) = x ∗ f(x^*) = x^* f(x∗)=x∗ (Fixed Point).
Example: equilibrium point of a discrete-time dynamics.

Method 1: Fixed-Point Iteration

Simplest solution method
If the fixed point is stable, iterate the dynamics until convergence
Only works for systems with a single equilibrium and stable fixed points
Slow convergence

Method 2: Newton’s Method

Fit a linear approximation to f ( x ) f(x) f(x):

f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x f(x+Δx)≈f(x)+∂x∂fΔx

Set approximation to zero and solve for Δ x \Delta x Δx:

f ( x ) + ∂ f ∂ x Δ x = 0 ⇒ Δ x = − ( ∂ f ∂ x ) − 1 f ( x ) f(x) + \frac{\partial f}{\partial x} \Delta x = 0 \Rightarrow \Delta x = - \left(\frac{\partial f}{\partial x}\right)^{-1} f(x) f(x)+∂x∂fΔx=0⇒Δx=−(∂x∂f)−1f(x)

Apply correction:

x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx

Repeat until convergence.

This means that we can update the values iteratively in this manner to make the gradient function converge to a critical point (or saddle point), thereby causing the original function f ( x ) f(x) f(x) to potentially converge to a local minimum.
More fundamentally, Newton’s method is essentially performing a second-order Taylor expansion of the function, using a local quadratic function to approximate the target function. By continuously searching for the local extremum of the quadratic function (which may not necessarily be a minimum, depending on whether the quadratic function is convex or concave, as will be explained below), we drive the final function to converge to a critical point. If the function to be optimized is already close to quadratic, Newton’s method can converge very quickly.
P.S.: Essentially, Newton’s method is not an optimization technique; it is a root-finding method. However, through Newton’s method, we can find points where the local gradient is zero, but these points are not necessarily local minima.

Example: Backward Euler¹

在这里插入图片描述

Both methods can get a damping motion, but Newton’s method is much faster.

Plot the error during the iteration:
在这里插入图片描述

Very fast convergence with Newton’s method.
The convergence rate of fixed-point iteration is linear, while Newton’s method is quadratic.

Takeaway Message

Quadratic convergence (Newton’s method)
Can get machine precision
Most expensive part: solving linear system (calculation of Jacobian is not the most expensive part, but the factorization and inversion of Jacobian is O ( n 3 ) \mathcal{O}(n^3) O(n3) or n × n n \times n n×n matrix
Can improve complexity by taking advantage of problem structure (e.g., sparse Jacobian in many cases).

Minimization

Minimize f ( x ) f(x) f(x), where f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):Rn→R

If f f f is smooth, then:

∂ f ∂ x ∣ x ∗ = 0 \frac{\partial f}{\partial x} \Big|_{x^*} = 0 ∂x∂f x∗=0at a local minimum.
Thus, we can transform the minimization problem into a root-finding problem:

∇ f ( x ) = 0 \nabla f(x) = 0 ∇f(x)=0

Apply Newton’s method:

∇ f ( x + Δ x ) ≈ ∇ f ( x ) + ∂ ∂ x ( ∇ f ( x ) ) Δ x = ∇ f ( x ) + ∇ 2 f ( x ) Δ x = 0 \nabla f(x + \Delta x) \approx \nabla f(x) + \frac{\partial}{\partial x} (\nabla f(x)) \Delta x = \nabla f(x) + \nabla^2 f(x) \Delta x = 0 ∇f(x+Δx)≈∇f(x)+∂x∂(∇f(x))Δx=∇f(x)+∇2f(x)Δx=0

Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left( \nabla^2 f(x) \right)^{-1} \nabla f(x) Δx=−(∇2f(x))−1∇f(x) Update x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx, then repeat until convergence.

Intuition

Fit a quadratic approximation to f ( x ) f(x) f(x)
Exactly minimize quadratic approximation

Example²

Minimize f ( x ) = x 4 + x 3 − x 2 − x f(x) = x^4 + x^3 - x^2 - x f(x)=x4+x3−x2−x

Start at: x = 1.0 x = 1.0 x=1.0 ⇒ converges to the global minimum
Start at: x = − 1.5 x = -1.5 x=−1.5⇒ converges to a local minimum
Start at: x = 0.0 x = 0.0 x=0.0 ⇒ converges to a local maximum

Takeaway Message

Newton’s method is a local root-finding method. It will converge to the closest fixed point to the initial guess (minimum, maximum, or saddle point).

Sufficient Conditions

First-order necessary condition: ∇ f ( x ) = 0 \nabla f(x) = 0 ∇f(x)=0 (not sufficient for a minimum)

Scalar case:

Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left( \nabla^2 f(x) \right)^{-1} \nabla f(x) Δx=−(∇2f(x))−1∇f(x)

The Hessian ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x) can be interpreted as the learning rate or step size, so it should be positive definite.

1. ∇ 2 f ( x ) > 0 ⇒ \nabla^2 f(x) > 0 \Rightarrow ∇2f(x)>0⇒descent (minimization)
2. ∇ 2 f ( x ) < 0 ⇒ \nabla^2 f(x) < 0 \Rightarrow ∇2f(x)<0⇒ascent (maximization)

Vector case R n \mathbb{R}^n Rn:

∇ 2 f ( x ) ≻ 0 ( or ∇ 2 f ( x ) ∈ S + + n ) \nabla^2 f(x) \succ 0 \quad (\text{or} \quad \nabla^2 f(x) \in \mathbb{S}^n_{++}) ∇2f(x)≻0(or∇2f(x)∈S++n) Descent (minimization)

If ∇ 2 f ( x ) ≻ 0 \nabla^2 f(x) \succ 0 ∇2f(x)≻0, then f ( x ) f(x) f(x) is strongly convex, and Newton’s method will converge globally to the global minimum. However, this is not always true for hard/nonlinear problems.

Regularization

Regularization ensures the update direction is always a descent direction.
It is a practical solution to ensure we are always minimizing: check whether the Hessian is positive definite.

H ← ∇ 2 f ( x ) H \leftarrow \nabla^2 f(x) H←∇2f(x)
While H ⊁ 0 H \not\succ 0 H≻0:
\quad H ← H + β I H \leftarrow H + \beta I H←H+βI( β > 0 \beta > 0 β>0)
end
Δ x = − H − 1 ∇ f ( x ) \Delta x = - H^{-1} \nabla f(x) Δx=−H−1∇f(x)
x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx

This is also called damped Newton’s method. It guarantees that the update direction is always a descent direction and makes the step size smaller by adding β I \beta I βI, which is equivalent to adding a quadratic penalty term to Δ x \Delta x Δx.

By adding a fixed value to the diagonal of the Hessian matrix, we ensure that the modified matrix is positive definite, thereby guaranteeing convergence. If the value is too large, it will result in a step size that is too small (at this point, the direction will become the gradient descent direction), causing slower convergence. In such cases, a line search approach may be needed. However, in most cases, this method is effective: it not only ensures a descent direction but also helps shrink the step size (since Newton’s method typically yields relatively large step sizes).
To determine whether the Hessian matrix is positive definite, avoid using eigenvalue computation, as it is time-consuming. Instead, perform a Cholesky decomposition of the Hessian matrix. Since Cholesky decomposition requires the matrix to be positive definite, if an error occurs during the decomposition, it indicates that the Hessian matrix is not positive definite.
There are various engineering methods to ensure that the Hessian is positive definite, such as performing eigenvalue decomposition and setting negative eigenvalues to zero. However, this approach is too time-consuming.

References

[1] 【2024 Optimal Control and Reinforcement Learning 16-745】【Lecture 3】优化（上）——导数符号、寻根、牛顿法、正则化、最小化
[2] 【Optimal Control (CMU 16-745)】Lecture 3 Optimization Part 1
[3] CMU Optimal Control 16-745 Video From Bilibili
[4] CMU Optimal Control 16-745

感谢大家阅读这篇学习笔记！记录学习的内容，如果有问题或者不准确的地方，欢迎随时联系我改正，与大家共同进步！

You can run the root-finding code directly in Colab by clicking here. ↩︎
You can run the minimization code directly in Colab by clicking here. ↩︎

标签：Newton,partial,Optimization,nabla,frac,Part,Delta,Lecture,method
From： https://blog.csdn.net/weixin_61338898/article/details/144833680

[CMU16-745] Lecture 3 Optimization Part 1

Content

Review

Discrete-time Dynamics/Simulations

Optimization Pt.1

Notation

Root Finding

Method 1: Fixed-Point Iteration

Method 2: Newton’s Method

Takeaway Message

Minimization

Intuition

Example²

Takeaway Message

Sufficient Conditions

Regularization

References

相关文章

赞助商

阅读排行

[CMU16-745] Lecture 3 Optimization Part 1

Content

Review

Discrete-time Dynamics/Simulations

Optimization Pt.1

Notation

Root Finding

Method 1: Fixed-Point Iteration

Method 2: Newton’s Method

Takeaway Message

Minimization

Intuition

Example2

Takeaway Message

Sufficient Conditions

Regularization

References

相关文章

赞助商

阅读排行

Example²