首页 > 其他分享 >[CMU16-745] Lecture 3 Optimization Part 1

[CMU16-745] Lecture 3 Optimization Part 1

时间:2025-01-03 18:59:08浏览次数:3  
标签:Newton partial Optimization nabla frac Part Delta Lecture method

Pendulum System Diagram
Source: CMU 16-745 Study Notes, taught by Prof. Zac Manchester

Lecture2 Dynamics Discretization and Stability

Content

Review

Discrete-time Dynamics/Simulations

  • Stability of Discrete-time Systems
  • Forward/Backward Euler Methods
  • RK4 Method
  • Zero/First Order Hold on Controls

Optimization Pt.1

Notation

  • Scalar Function f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):Rn→R

∂ f ∂ x = [ ∂ f ∂ x 1 ⋯ ∂ f ∂ x n ] ∈ R 1 × n \frac{\partial f}{\partial x} = \left[\frac{\partial f}{\partial x_1} \cdots \frac{\partial f}{\partial x_n}\right] \in \mathbb{R}^{1 \times n} ∂x∂f​=[∂x1​∂f​⋯∂xn​∂f​]∈R1×nis a row vector.

This is because ∂ f ∂ x \frac{\partial f}{\partial x} ∂x∂f​is the linear operator mapping Δ x \Delta x Δx into Δ f \Delta f Δf:

f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x f(x+Δx)≈f(x)+∂x∂f​Δx

  • Vector Function g ( y ) : R m → R n g(y): \mathbb{R}^m \rightarrow \mathbb{R}^n g(y):Rm→Rn

∂ g ∂ y ∈ R n × m \frac{\partial g}{\partial y} \in \mathbb{R}^{n \times m} ∂y∂g​∈Rn×m and similarly:

g ( y + Δ y ) ≈ g ( y ) + ∂ g ∂ y Δ y g(y + \Delta y) \approx g(y) + \frac{\partial g}{\partial y} \Delta y g(y+Δy)≈g(y)+∂y∂g​Δy

  • Chain Rule:

f ( g ( y + Δ y ) ) ≈ f ( g ( y ) ) + ∂ f ∂ x ∣ g ( y ) ⋅ ∂ g ∂ y ∣ y ⋅ Δ y f(g(y + \Delta y)) \approx f(g(y)) + \left.\frac{\partial f}{\partial x}\right|_{g(y)} \cdot \left.\frac{\partial g}{\partial y}\right|_y \cdot \Delta y f(g(y+Δy))≈f(g(y))+∂x∂f​ ​g(y)​⋅∂y∂g​ ​y​⋅Δy

For convenience, we will define the Jacobian matrix as:

∇ f ( x ) = ( ∂ f ∂ x ) T ∈ R n × 1 \nabla f(x) = \left( \frac{\partial f}{\partial x} \right)^T \in \mathbb{R}^{n \times 1} ∇f(x)=(∂x∂f​)T∈Rn×1

and the Hessian matrix as:

∇ 2 f ( x ) = ∂ ∂ x ( ∇ f ( x ) ) = ∂ 2 f ∂ x 2 ∈ R n × n \nabla^2 f(x) = \frac{\partial}{\partial x} \left( \nabla f(x) \right) = \frac{\partial^2 f}{\partial x^2} \in \mathbb{R}^{n \times n} ∇2f(x)=∂x∂​(∇f(x))=∂x2∂2f​∈Rn×n

Approximation:
f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x + 1 2 δ x T ∂ 2 f ∂ x 2 Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x + \frac{1}{2} \delta x^T \frac{\partial^2 f}{\partial x^2} \Delta x f(x+Δx)≈f(x)+∂x∂f​Δx+21​δxT∂x2∂2f​Δx

  1. Δ x \Delta x Δx typically represents a finite change or increment of a variable. It denotes the difference or change in a variable over a specific range. In the context of differentiation and linearization, Δ x \Delta x Δx is the actual change between two points, often represented as a vector. For example, in the linear approximation of a function, Δ x \Delta x Δx represents the difference from x x x to x + Δ x x + \Delta x x+Δx, and it is a finite increment.
  2. δ x \delta x δx typically represents an infinitesimal change, or a very small variation, and is often used to indicate small changes in differential operations. In Taylor expansions, δ x \delta x δx represents an infinitesimal change and is used to approximate second-order terms in functions.For example, in a second-order Taylor expansion, δ x \delta x δx represents an infinitesimal change and is associated with second derivatives, representing the second-order approximation of a function’s value.
  • Δ x \Delta x Δx is a finite increment, typically representing the actual change in a variable.
  • δ x \delta x δx is an infinitesimal increment, usually used to describe small changes, especially in Taylor expansions and second-order approximations.

Root Finding

  • Given f ( x ) f(x) f(x), find x ∗ x^* x∗ such that f ( x ∗ ) = 0 f(x^*) = 0 f(x∗)=0.
    Example: equilibrium point of a continuous-time dynamics.

  • Given f ( x ) f(x) f(x), find x ∗ x^* x∗such that f ( x ∗ ) = x ∗ f(x^*) = x^* f(x∗)=x∗ (Fixed Point).
    Example: equilibrium point of a discrete-time dynamics.

Method 1: Fixed-Point Iteration

  • Simplest solution method
  • If the fixed point is stable, iterate the dynamics until convergence
  • Only works for systems with a single equilibrium and stable fixed points
  • Slow convergence
    在这里插入图片描述在这里插入图片描述

Method 2: Newton’s Method

  1. Fit a linear approximation to f ( x ) f(x) f(x):

f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x f(x + \Delta x) \approx f(x) + \frac{\partial f}{\partial x} \Delta x f(x+Δx)≈f(x)+∂x∂f​Δx

  1. Set approximation to zero and solve for Δ x \Delta x Δx:

f ( x ) + ∂ f ∂ x Δ x = 0 ⇒ Δ x = − ( ∂ f ∂ x ) − 1 f ( x ) f(x) + \frac{\partial f}{\partial x} \Delta x = 0 \Rightarrow \Delta x = - \left(\frac{\partial f}{\partial x}\right)^{-1} f(x) f(x)+∂x∂f​Δx=0⇒Δx=−(∂x∂f​)−1f(x)

  1. Apply correction:

x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx

  1. Repeat until convergence.

This means that we can update the values iteratively in this manner to make the gradient function converge to a critical point (or saddle point), thereby causing the original function f ( x ) f(x) f(x) to potentially converge to a local minimum.
More fundamentally, Newton’s method is essentially performing a second-order Taylor expansion of the function, using a local quadratic function to approximate the target function. By continuously searching for the local extremum of the quadratic function (which may not necessarily be a minimum, depending on whether the quadratic function is convex or concave, as will be explained below), we drive the final function to converge to a critical point. If the function to be optimized is already close to quadratic, Newton’s method can converge very quickly.
P.S.: Essentially, Newton’s method is not an optimization technique; it is a root-finding method. However, through Newton’s method, we can find points where the local gradient is zero, but these points are not necessarily local minima.

Example: Backward Euler1

在这里插入图片描述

Both methods can get a damping motion, but Newton’s method is much faster.

Plot the error during the iteration:
在这里插入图片描述

Very fast convergence with Newton’s method.
The convergence rate of fixed-point iteration is linear, while Newton’s method is quadratic.

Takeaway Message

  • Quadratic convergence (Newton’s method)
  • Can get machine precision
  • Most expensive part: solving linear system (calculation of Jacobian is not the most expensive part, but the factorization and inversion of Jacobian is O ( n 3 ) \mathcal{O}(n^3) O(n3) or n × n n \times n n×n matrix
  • Can improve complexity by taking advantage of problem structure (e.g., sparse Jacobian in many cases).

Minimization

Minimize f ( x ) f(x) f(x), where f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):Rn→R

If f f f is smooth, then:

∂ f ∂ x ∣ x ∗ = 0 \frac{\partial f}{\partial x} \Big|_{x^*} = 0 ∂x∂f​ ​x∗​=0at a local minimum.
Thus, we can transform the minimization problem into a root-finding problem:

∇ f ( x ) = 0 \nabla f(x) = 0 ∇f(x)=0

Apply Newton’s method:

∇ f ( x + Δ x ) ≈ ∇ f ( x ) + ∂ ∂ x ( ∇ f ( x ) ) Δ x = ∇ f ( x ) + ∇ 2 f ( x ) Δ x = 0 \nabla f(x + \Delta x) \approx \nabla f(x) + \frac{\partial}{\partial x} (\nabla f(x)) \Delta x = \nabla f(x) + \nabla^2 f(x) \Delta x = 0 ∇f(x+Δx)≈∇f(x)+∂x∂​(∇f(x))Δx=∇f(x)+∇2f(x)Δx=0

Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left( \nabla^2 f(x) \right)^{-1} \nabla f(x) Δx=−(∇2f(x))−1∇f(x) Update x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx, then repeat until convergence.

Intuition

  • Fit a quadratic approximation to f ( x ) f(x) f(x)
  • Exactly minimize quadratic approximation

Example2

Minimize f ( x ) = x 4 + x 3 − x 2 − x f(x) = x^4 + x^3 - x^2 - x f(x)=x4+x3−x2−x

  • Start at: x = 1.0 x = 1.0 x=1.0 ⇒ converges to the global minimum
    在这里插入图片描述

  • Start at: x = − 1.5 x = -1.5 x=−1.5⇒ converges to a local minimum
    在这里插入图片描述

  • Start at: x = 0.0 x = 0.0 x=0.0 ⇒ converges to a local maximum
    在这里插入图片描述

Takeaway Message

  • Newton’s method is a local root-finding method. It will converge to the closest fixed point to the initial guess (minimum, maximum, or saddle point).

Sufficient Conditions

  • First-order necessary condition: ∇ f ( x ) = 0 \nabla f(x) = 0 ∇f(x)=0 (not sufficient for a minimum)

Scalar case:

Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left( \nabla^2 f(x) \right)^{-1} \nabla f(x) Δx=−(∇2f(x))−1∇f(x)

The Hessian ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x) can be interpreted as the learning rate or step size, so it should be positive definite.

1. ∇ 2 f ( x ) > 0 ⇒ \nabla^2 f(x) > 0 \Rightarrow ∇2f(x)>0⇒descent (minimization)
2. ∇ 2 f ( x ) < 0 ⇒ \nabla^2 f(x) < 0 \Rightarrow ∇2f(x)<0⇒ascent (maximization)

Vector case R n \mathbb{R}^n Rn:

∇ 2 f ( x ) ≻ 0 ( or ∇ 2 f ( x ) ∈ S + + n ) \nabla^2 f(x) \succ 0 \quad (\text{or} \quad \nabla^2 f(x) \in \mathbb{S}^n_{++}) ∇2f(x)≻0(or∇2f(x)∈S++n​) Descent (minimization)

If ∇ 2 f ( x ) ≻ 0 \nabla^2 f(x) \succ 0 ∇2f(x)≻0, then f ( x ) f(x) f(x) is strongly convex, and Newton’s method will converge globally to the global minimum. However, this is not always true for hard/nonlinear problems.

Regularization

Regularization ensures the update direction is always a descent direction.
It is a practical solution to ensure we are always minimizing: check whether the Hessian is positive definite.

H ← ∇ 2 f ( x ) H \leftarrow \nabla^2 f(x) H←∇2f(x)
While H ⊁ 0 H \not\succ 0 H≻0:
\quad H ← H + β I H \leftarrow H + \beta I H←H+βI( β > 0 \beta > 0 β>0)
end
Δ x = − H − 1 ∇ f ( x ) \Delta x = - H^{-1} \nabla f(x) Δx=−H−1∇f(x)
x ← x + Δ x x \leftarrow x + \Delta x x←x+Δx

This is also called damped Newton’s method. It guarantees that the update direction is always a descent direction and makes the step size smaller by adding β I \beta I βI, which is equivalent to adding a quadratic penalty term to Δ x \Delta x Δx.

By adding a fixed value to the diagonal of the Hessian matrix, we ensure that the modified matrix is positive definite, thereby guaranteeing convergence. If the value is too large, it will result in a step size that is too small (at this point, the direction will become the gradient descent direction), causing slower convergence. In such cases, a line search approach may be needed. However, in most cases, this method is effective: it not only ensures a descent direction but also helps shrink the step size (since Newton’s method typically yields relatively large step sizes).
To determine whether the Hessian matrix is positive definite, avoid using eigenvalue computation, as it is time-consuming. Instead, perform a Cholesky decomposition of the Hessian matrix. Since Cholesky decomposition requires the matrix to be positive definite, if an error occurs during the decomposition, it indicates that the Hessian matrix is not positive definite.
There are various engineering methods to ensure that the Hessian is positive definite, such as performing eigenvalue decomposition and setting negative eigenvalues to zero. However, this approach is too time-consuming.

References

[1] 【2024 Optimal Control and Reinforcement Learning 16-745】【Lecture 3】优化(上)——导数符号、寻根、牛顿法、正则化、最小化
[2] 【Optimal Control (CMU 16-745)】Lecture 3 Optimization Part 1
[3] CMU Optimal Control 16-745 Video From Bilibili
[4] CMU Optimal Control 16-745

感谢大家阅读这篇学习笔记!记录学习的内容,如果有问题或者不准确的地方,欢迎随时联系我改正,与大家共同进步!


  1. You can run the root-finding code directly in Colab by clicking here. ↩︎

  2. You can run the minimization code directly in Colab by clicking here. ↩︎

标签:Newton,partial,Optimization,nabla,frac,Part,Delta,Lecture,method
From: https://blog.csdn.net/weixin_61338898/article/details/144833680

相关文章