<input type="checkbox" id="" disabled=""

New line searches about optimistix HOT 3 OPEN

patrick-kidger commented on May 23, 2024

New line searches

from optimistix.

Comments (3)

HeSchatz commented on May 23, 2024

Hey there. I gave implementing the zoom line search algorithm a try, as I find it to be a generally well-working method. At least for some of the problems I have tackled in the past.

The main obstacle (apart from getting used to the JAX control flow for nested if-expressions) is how to retrieve the descent information needed to evaluate the decrease and curvature conditions as well as interpolate the step sizes. In the current implementation of the search and descent interfaces there seems to be no general way to retrieve the descent direction after the descent step is made. Also, the descent methods currently have no common naming scheme regarding the descent direction, if it's even calculated explicitly (I get that this may not alway be necessary), which further complicates things. My understanding is that one would have to divide the current step by the step size to (re-)evaluate the descent direction, which may generally be a bad idea for a value that is supposed to decrease with further convergence. Or am I missing something obvious?

If not, my proposal would be to modify the interface of the descent methods in a way that allows to retrieve the current descent direction in a unified way. Also, I think the abstract line search step method should have an additional argument for the descent direction or even the descent state, as this is most likely required by complex line search methods in one way or another.

On a side note, I'm also working on implementing the L-BFGS algorithm (kind of finished, at least from a methodological point of view), which also relies on common interfaces for descents.

Looking forward to further discuss this topic!

from optimistix.

packquickly commented on May 23, 2024

Hello! Thanks for the interest in implementing some new line searches.

Just to be clear, for a minimisation problem of a function $f\colon \mathbb{R}^n \to \mathbb{R}$, an Optimistix descent $\delta\colon \mathbb{R}_+ \to \mathbb{R}^n$ is a curve in $\mathbb{R}^n$ parameterised by the step-size. Line searches handle the inner optimisation loop of choosing $\alpha$ such that $f(y_k + \delta(\alpha))$ is approximately minimised with respect to $\alpha$. We only do the classical line-search algorithm $f(y_k + \alpha p_k)$ for a fixed descent direction $p_k \in \mathbb{R}^n$ when using certain descents, such as NewtonDescent and SteepestDescent. We don't have a common name/interface for the descent direction $p_k$ because most the descents don't compute a quantity like $p_k$ at all, so it's intentionally not supported. It also shouldn't be needed for the Wolfe conditions (or related conditions.)

$\delta(\alpha)$ is computed with the step method on any descent (see the abstract base class here.) The way this gets passed to the line search is via the y_eval argument in the line search step method. y_eval is the proposed next step for a given step-size $\alpha_k$. For a step-size $\alpha_k$ the actual diff $\delta(\alpha_k)$ can then be computed by subtracting y_eval from y, see the BacktrackingArmijo line search for an example.

As for technical details of implementing the Wolfe conditions, there should be no issues here. The classic Wolfe conditions for current iterate $y_k \in \mathbb{R}^n$, proposed step-size $\alpha_k \in \mathbb{R}$ and descent direction $p_k \in \mathbb{R}^n$ are the sufficient decrease condition
$$f(y_k + \alpha_k p_k) < f(y_k) + c_1 \alpha_k p^T \nabla f(y_k)$$
and the curvature condition
$$\nabla f(y_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f(y_k)^T p_k$$
for some constants $c_1, c_2$. We can replace the traditional line-search step $\alpha_k p_k$ with the generalised descent $\delta(\alpha_k)$
$$f(y_k + \delta(\alpha_k)) < f(y_k) + c_1 \delta(\alpha_k)^T \nabla f(y_k)$$
and the curvature condition
$$\nabla f(y_k + \delta(\alpha_k))^T \delta(\alpha_k) \geq c_2 \nabla f(y_k)^T \delta(\alpha_k)$$
without changing the meaning of either of the conditions. When $\delta(\alpha) = \alpha p_k$ these are equivalent to the classical Wolfe conditions. Different relaxations of the Wolfe conditions should be generalisable in roughly the same way.

Does this clear things up?

from optimistix.

HeSchatz commented on May 23, 2024

Thank you for taking the time to write down the general line search problem and current implementation in such detail. This actually helps a lot to clarify the problem I'm currently facing.

First, you are totally right about the Wolfe conditions. I understood that the descent information is computed by substracting y from the current evaluate y_eval, which gives

$\delta(\alpha_k) = \alpha_k p_k $

Evaluating the sufficient decrease condition with $\delta(\alpha_k)$ is straight forward, but I did not notice that multiplying the gradient with $p_k$ or $\delta(\alpha_k)$ on both sides of the curvature condition is equivalent.

The problem still persisting is the computation of new trial step sizes $\alpha_{k+1}$ in the zoom phase of the line search algorithm suggested in Nocedal, Wright (2006) - Numerical Optimization, 2nd ed. (algorithm 3.6 and 3.5, respectively).

They propose computing new trial step sizes $\alpha_{k+1}$ by minimizing the cubic function interpolating $\phi(\alpha_{k-1})$. $\phi'(\alpha_{k-1})$, $\phi(\alpha_k)$ and $\phi'(\alpha_k)$. The new step size can then be computed by:

$\alpha_{k+1} = \alpha_k - (\alpha_k - \alpha_{k-1}) \left[ \frac{\phi'(\alpha_k) + d_2 - d_1}{\phi'(\alpha_k) - \phi'(\alpha_{k - 1}) + 2 d_2} \right]$

with

$d_1 = \phi'(\alpha_{k-1}) + \phi'(\alpha_k) - 3 \frac{\phi(\alpha_{k-1}) - \phi(\alpha_k)}{\alpha_{k-1} - \alpha_k}$

$d_2 = \left[ d_1^2 - \phi'(\alpha_{k-1}) \phi'(\alpha_k) \right]^{1/2}$

This requires explicit computation of the derivative

$\phi'(\alpha) = \nabla f(y_k + \alpha p_k)^T p_k = \nabla f(y_k + \delta(\alpha))^T \delta'(\alpha)$

$\phi(\alpha) = f(y_k + \alpha p_k) = f(y_k + \delta(\alpha))$

which requires computation of the descent direction $p_k$. At least I could not come up with a way to evaluate the minimizer without doing so.

Please let me know if you have any suggestions. Maybe I'm once again misleaded by the usual notation.

Edit: fixed some typos in equations

from optimistix.

New line searches about optimistix HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent