mlcolab / pathfinder.jl Goto Github PK

View Code? Open in Web Editor NEW

74.0 6.0 4.0 26.42 MB

Preheat your MCMC

Home Page: https://mlcolab.github.io/Pathfinder.jl/

License: MIT License

Julia 100.00%

bayesian-inference probabilistic-programming variational-inference

pathfinder.jl's Introduction

Pathfinder.jl: Parallel quasi-Newton variational inference

An implementation of Pathfinder, ¹ a variational method for initializing Markov chain Monte Carlo (MCMC) methods.

Lu Zhang, Bob Carpenter, Andrew Gelman, Aki Vehtari (2021). Pathfinder: Parallel quasi-Newton variational inference. arXiv: 2108.03782 [stat.ML]. Code ↩

pathfinder.jl's People

Contributors

Stargazers

Watchers

Forkers

mohamed82008 burtonjosh cscherrer bparbhu

pathfinder.jl's Issues

More-Thuente line search fails for posterior

This is more of an Optim issue than a Pathfinder one, but I'm saving this example here for future reference.

Nocedal and Wright recommend using the More-Thuente line search method with L-BFGS. However, for this particular log-density, More-Thuente fails while Hagher-Zhang succeeds:

using ForwardDiff, Pathfinder, Random
using Pathfinder.Optim
using Pathfinder.Optim.LineSearches
Random.seed!(42)

z = [-0.32, -1.3, -0.79, 0.88, 0.13, -0.72, -0.31, -0.56, 0.28, 0.7, -0.43, -1.68, -0.91, 1.16, 1.04, -1.35, -0.5, 0.21, -0.08, -1.26]
y = [8.96, 11.48, 10.9, 7.11, 10.84, 12.91, 9.55, 11.04, 8.84, 8.91, 10.44, 12.75, 10.39, 8.27, 9.3, 10.55, 10.74, 10.14, 12.13, 9.77]

# α ~ Flat(), β ~ Normal(0, 1), σ ~ HalfNormal(1), y ~ MvNormal(z + α + β, σ*I)
function logp(x)
    α, β, logσ = x
    σ = exp(logσ)
    return (β^2 + σ^2 + sum(abs2, (z .- y .+ α .+ β) ./ σ)) / -2 + logσ
end
∇logp(x) = ForwardDiff.gradient(logp, x)

julia> θ₀ = rand(3) .* 4 .- 2
3-element Vector{Float64}:
  0.13273206417544525
 -0.18388345765143033
 -1.9292526931401426

julia> q, θ = pathfinder(logp, ∇logp, θ₀, 10; optimizer=LBFGS(m=5, linesearch=MoreThuente()));
[ Info: Optimized for 3 iterations. Maximum ELBO of -Inf reached at iteration 1.

julia> q.μ
3-element Vector{Float64}:
 17.292063592972053
  0.0
 -1.6625205206909156e162

The culprit is the gradient, which explodes during L-BFGS computation:

julia> θs, logpθs, ∇logpθs = Pathfinder.maximize_with_trace(logp, ∇logp, θ₀, LBFGS(m=5, linesearch=MoreThuente()));

julia> ∇logpθs
4-element Vector{Vector{Float64}}:
 [10040.660277421557, 10040.844160879209, 110412.25151284959]
 [-2.8542596163821086e-160, -16.975762325101126, -1.6625205206909156e162]
 [10040.660277421557, 10040.844160879209, 110412.25151284956]
 [10040.660277421557, 10040.844160879209, 110412.25151284956]

However, if we use the HagerZhang line search method, then we get reasonable results.

julia> q, θ = pathfinder(logp, ∇logp, θ₀, 10; optimizer=LBFGS(m=5, linesearch=HagerZhang()));
[ Info: Optimized for 17 iterations. Maximum ELBO of -7.35 reached at iteration 15.

julia> q.μ
3-element Vector{Float64}:
 10.5415000553547
 -4.1972214879016894e-8
  1.1392597435574199

The More-Thuente line search by itself doesn't seem to be the issue though. If we use it with ConjugateGradient, then we likewise get a reasonable answer:

julia> q, θ = pathfinder(logp, ∇logp, θ₀, 10; optimizer=ConjugateGradient(linesearch=MoreThuente()));
[ Info: Optimized for 27 iterations. Maximum ELBO of -7.44 reached at iteration 11.

julia> q.μ
3-element Vector{Float64}:
 10.491138291213138
  0.026184553956604018
  1.1552195552341622

NaNs introduced with 3 posteriordb models

Nikolas Siccha on Slack shared:

I've tried [Pathfinder.jl] for a few (3) models from posteriordb, and pathfinder seems to run into nans relatively often, without being able to recover.

It was the dogs-dogs, kilpisjarvi_mod-kilpisjarvi and diamonds-diamonds posteriors from posteriordb, brought to Julia using bridgestan.

Options for approximation?

Since the MAP step seems to run to convergence anyway, it seems like it might be straightforward to get to a Laplace approximation. In some settings, it could be useful to have this instead of or in addition to the variational fit. Maybe this could be a configuration option?

Implement the AdvancedVI interface

Even though Pathfinder is intended for initializing MCMC, it is a variational method, so we should implement the AdvancedVI.jl interface.

Currently AdvancedVI is missing an API function we would need (see TuringLang/AdvancedVI.jl#32). Also, it seems to be in the middle of a rewrite (see TuringLang/AdvancedVI.jl#25), so it might be worth waiting on this until the rewrite is finished.

WoodburyPDMat and unwhitening and square norms

For

L = cholesky(Symmetric(inv(Matrix(pf_result.fit_distribution.Σ)))).L

I want to compute norm(L\x)^2 and F.L'\randn(d). Can I express this with the WoodburyPDMat interface?

Load time is doubled on the latest Julia beta

Julia 1.7.2

julia> @time using Pathfinder
  2.264230 seconds (7.80 M allocations: 538.022 MiB, 5.56% gc time, 33.56% compilation time)

Julia 1.8.0-beta3

julia> @time using Pathfinder
  4.412077 seconds (11.26 M allocations: 765.483 MiB, 7.98% gc time, 55.01% compilation time)

Nothing to do yet, just tracking this for future reference.

Use GalacticOptim instead of Optim

@mohamed82008 recommended using GalacticOptim's API instead of Optim's, since then we can support optimizers from many more optimization packages: https://galacticoptim.sciml.ai/stable/

Making Optimization (formerly GalacticOptim) optional

On the 1.8 beta, loading GalacticOptim amd GalacticOptimJL together makes up over half the load time of Pathfinder.

julia> @time_imports using Pathfinder
    ...
   1172.3 ms    ┌ GalacticOptim
   1237.2 ms  ┌ GalacticOptimJL
    ...
   4552.7 ms  Pathfinder

This is not ideal, since we use these to support experimentation with other optimizers, not for normal usage. Currently we build everything around GalacticOptim, but perhaps we should make this conditionally loaded and default to doing everything with Optim types.

Relates #36

Document how to use Pathfinder to initialize PPLs

Originally posted by @joshualeond in #10 (comment):

Exciting to see this work being done in Julia! Do you think it would make sense to include an example of how you could use the results of this package to initialize MCMC in Turing/Soss/Stan/etc?

Benchmark with posteriordb models

We should benchmark the method similarly to the Pathfinder paper by running Pathfinder on many posteriordb models using code adapted from #94 (comment). We shouldn't do this in CI though, because the benchmarks would probably take a long time to run, but it would be nice to show the results in the docs and periodically re-run them.

We should look at how other packages like some in the SciML ecosystem do this.

Using Optimization.MaxSense

Currently we need to negate the user-provided function, gradients, and Hessian to to maximize the log-density with Optimization.jl, which by default assumes the optimization problem should be minimized. However, OptimizationProblem can take a sense=Optimization.MaxSense keyword to treat the optimization problem as a maximization one.

This is only worth doing if we can make sure that whether we use Optim.jl or Optimization.jl, the downstream code can be the same.

Return all intermediates in a custom struct

Currently pathfinder only returns the ELBO-maximizing distribution as well as the requested samples, but for debugging and visualization, it would be better to return a PathfinderResult object that contains also the trace, gradients, and all Laplace approximations.

multipathfinder could return a MultiPathfinderResult object that contains the vector of objects, all draws used for importance sampling, and the Pareto k parameter returned by PSIS.

Turing integration

We should include the following methods:

pathfinder(::DynamicPPL.Model, ndraws; init=DynamicPPL.SampleFromUniform(), kwargs...) -> MCMCChains.Chains
multipathfinder(::DynamicPPL.Model, nruns, ndraws; init=DynamicPPL.SampleFromUniform(), kwargs...) -> MCMCChains.Chains

Here the convenience method takes care of

getting an initial point
getting the log-density for the unconstrained form of the model
running pathfinder and getting draws
transforming the draws back to the constrained space
packing the draws into a familiar Chains object, whose info contains the Pathfinder results

This would make it more straightforward for Turing users to incorporate Pathfinder into their workflows.

Fixing type-noninferrability from Optimization.solve

Optimization v3.13.1 implemented SciMLBase.init for OptimizationProblem, which caused a check for type-inferrability in our test suite to begin failing (see first time it failed here).

It's not yet clear to me which change broke type-inference, or exactly how broken type-inference is, but perhaps we can fix it by using SciMLBase.init ourselves, followed by SciMLBase.solve!.

Drop StatsFuns as a dependency

We only use it for log2π, and it just re-exports that from IrrationalConstants.jl.

Choosing AD backend in Turing integration

Does Pathfinder interact with the AD backend chosen in Turing, ie.., Turing.setadbackend ? As far as I can tell from the source, there is no way to pass a specific AD backend when using the Turing integration - it seems like Forwarddiff is used as the default.

P.S.
Thanks for developing Pathfinder.jl ! Great to have Pathfinder in the Julia ecosystem.

multipathfinder - no method matching iterate

Dear Seth,

Cool package! I have some troubles running the multi-pathfinder with the most recent release. The actual computation works fine, but it seems like an iterator is not working when collecting the draws (line 196 in multipath.jl). MWE:

using Distributions #Distributions v0.25.59
using Pathfinder #Pathfinder v0.4.5
using ForwardDiff


function targetclosure(data)
    function target(θ)
        μ = sum(θ)
        return sum( logpdf.(Normal(μ, 1.0), data) )
    end
end

target = targetclosure(randn(1000))
θ = randn(2)
∇logp(θ) = ForwardDiff.gradient(target, θ)

#using Pathfinder works -> posterior mean of θ == 0
result = pathfinder(target, ∇logp; init = θ)

#using multipathfinder
Nruns = 10
θ_initial = [zeros(2) for _ in Base.OneTo(Nruns)]
for iter in eachindex(θ_initial)
    θ_initial[iter] = randn(2)
end

q, ϕ, component_ids = multipathfinder(
    target,
    ∇logp,
    Nruns;
    init = θ_initial
) #MethodError: no method matching iterate(::MultiPathfinderResult{Tuple{var"#target#8"{Vector{Float64}}, typeof(∇logp)}

Best regards,
Patrick

Support alternative ways of choosing normal approximations

Given an optimization trace, Pathfinder proposes the multivariate normal approximation constructed from the trace that maximizes the ELBO. It does this by approximating the ELBO at each point.

The discussion notes that instead of exhaustively approximating the ELBO at each point, Bayesian optimization could be used to optimize over (or even between) the points. More generally, we could allow alternative objective functions than ELBO and allow any discrete optimizer to be provided. While between points, we could interpolate means, we'd need to think a bit about how to interpolate covariances between points.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Making subpackages for compatibility with HMC packages

Currently we use Requires to conditionally load DynamicHMC or AdvancedHMC and implement functionality for compatibility. We should consider instead making subpackages PathfinderDynamicHMC and PathfinderAdvancedHMC for better dependency management.

Switching to Hager-Zhang line search

Optim.LBFGS defaults to LineSearches.HagerZhang for the linesearch, while we currently use Linesearches.MoreThuente in keeping with the Pathfinder paper.

However, in informal tests, I tend to encounter more numerical issues with MoreThuente, and Pathfinder needs to retry much more often. Moreover, note the different performance between the two methods on a multivariate normal log density:

julia> using Pathfinder, LinearAlgebra, Optim

julia> Σ = [
           2.71   0.5    0.19   0.07   1.04
           0.5    1.11  -0.08  -0.17  -0.08
           0.19  -0.08   0.26   0.07  -0.7
           0.07  -0.17   0.07   0.11  -0.21
           1.04  -0.08  -0.7   -0.21   8.65
       ];

julia> μ = [-0.55, 0.49, -0.76, 0.25, 0.94];

julia> P = inv(Symmetric(Σ));

julia> function logp_mvnormal(x)
           z = x - μ
           return -dot(z, P, z) / 2
       end;

julia> result = pathfinder(logp_mvnormal; dim=5, init_scale=4)
Single-path Pathfinder result
  tries: 1
  draws: 5
  fit iteration: 25 (total: 27)
  fit ELBO: 3.98 ± 0.18
  fit distribution: MvNormal{Float64, Pathfinder.WoodburyPDMat{Float64, Diagonal{Float64, Vector{Float64}}, Matrix{Float64}, Matrix{Float64}, Diagonal{Float64, Vector{Float64}}, LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}}, UpperTriangular{Float64, Matrix{Float64}}}, Vector{Float64}}(
dim: 5
μ: [-0.5499999997063411, 0.4899999984855008, -0.7599999997859143, 0.2500000006537195, 0.9399999998847138]
Σ: [3.033268135838942 0.4958177031900541 … 0.2454020867450207 1.1825514484977466; 0.4958177031900541 1.074411231505785 … -0.16076485670008017 -0.0911342669944741; … ; 0.2454020867450207 -0.16076485670008017 … 0.20144627493353065 -0.12991587816970984; 1.1825514484977466 -0.0911342669944741 … -0.12991587816970984 8.704692432462565]
)


julia> Σ - result.fit_distribution.Σ
5×5 Matrix{Float64}:
 -0.323268    0.0041823    0.0499336   -0.175402    -0.142551
  0.0041823   0.0355888   -0.00722606  -0.00923514   0.0111343
  0.0499336  -0.00722606  -0.00640033   0.0291832    0.019554
 -0.175402   -0.00923514   0.0291832   -0.0914463   -0.0800841
 -0.142551    0.0111343    0.019554    -0.0800841   -0.0546924

julia> result2 = pathfinder(logp_mvnormal; dim=5, init_scale=4, optimizer=LBFGS(m=6))
Single-path Pathfinder result
  tries: 1
  draws: 5
  fit iteration: 5 (total: 5)
  fit ELBO: 3.84 ± 0.0
  fit distribution: MvNormal{Float64, Pathfinder.WoodburyPDMat{Float64, Diagonal{Float64, Vector{Float64}}, Matrix{Float64}, Matrix{Float64}, Diagonal{Float64, Vector{Float64}}, LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}}, UpperTriangular{Float64, Matrix{Float64}}}, Vector{Float64}}(
dim: 5
μ: [-0.55, 0.49, -0.76, 0.25, 0.94]
Σ: [2.709999999999999 0.49999999999999983 … 0.07000000000000031 1.0399999999999991; 0.49999999999999983 1.1099999999999994 … -0.17000000000000087 -0.08000000000000052; … ; 0.07000000000000031 -0.17000000000000087 … 0.11000000000000121 -0.21000000000000002; 1.0399999999999991 -0.08000000000000052 … -0.21000000000000002 8.65]
)

julia> Σ - result2.fit_distribution.Σ
5×5 Matrix{Float64}:
  8.88178e-16   5.55112e-16  -1.94289e-16  -1.52656e-16  -2.22045e-16
 -3.33067e-16  -2.22045e-16   8.18789e-16   3.05311e-16   6.245e-16
 -8.32667e-17   8.46545e-16   2.22045e-16   7.21645e-16  -2.22045e-16
 -2.22045e-16   6.93889e-16   5.13478e-16  -1.20737e-15   1.38778e-16
  6.66134e-16   5.13478e-16  -2.22045e-16   0.0          -1.77636e-15

So Hager-Zhang exactly fits the target distribution, while Moré-Thuente only gets close. This result is reproducible. It's not clear whether this is due to some instability/bug in MoreThuente or if this is expected.

If this trend continues for more models, we should consider switching to HagerZhang as the default linesearch.

Multi-threaded multi-path Pathfinder broken with recent Transducers versions

Transducers v0.4.76 has caused our tests to fail (see first failure here).

Some not-necessarily-relevant notes:

In addition to the merged PRs in those release notes, JuliaFolds/Transducers.jl#553 was also merged, which supposedly resolved JuliaFolds/Transducers.jl#552
We currently use a deprecated method (from https://github.com/mlcolab/Pathfinder.jl/actions/runs/4920822614/jobs/8790013138#step:6:775):

┌ Warning: `reduce(rf, itr::Foldable; kw...)` is deprecated, use `foldxt(rf, itr; kw...)` instead.
│   caller = multipathfinder(optim_fun::OptimizationFunction{true, SciMLBase.NoAD, Pathfinder.var"#f#5"{LogDensityFunctionWithGradHess{var"#logp#44"{FullNormal}, var"#∇f#40"{var"#logp#44"{FullNormal}}, var"#Hf#41"{var"#logp#44"{FullNormal}}}}, Pathfinder.var"#grad#6"{LogDensityFunctionWithGradHess{var"#logp#44"{FullNormal}, var"#∇f#40"{var"#logp#44"{FullNormal}}, var"#Hf#41"{var"#logp#44"{FullNormal}}}}, Pathfinder.var"#hess#7"{LogDensityFunctionWithGradHess{var"#logp#44"{FullNormal}, var"#∇f#40"{var"#logp#44"{FullNormal}}, var"#Hf#41"{var"#logp#44"{FullNormal}}}}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED_NO_TIME), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, ndraws::Int64; init::Nothing, input::LogDensityFunctionWithGradHess{var"#logp#44"{FullNormal}, var"#∇f#40"{var"#logp#44"{FullNormal}}, var"#Hf#41"{var"#logp#44"{FullNormal}}}, nruns::Int64, ndraws_elbo::Int64, ndraws_per_run::Int64, rng::MersenneTwister, history_length::Int64, optimizer::LBFGS{Nothing, LineSearches.InitialStatic{Float64}, LineSearches.MoreThuente{Float64}, Optim.var"#20#22"}, executor::SequentialEx{NamedTuple{(), Tuple{}}}, executor_per_run::SequentialEx{NamedTuple{(), Tuple{}}}, importance::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:dim,), Tuple{Int64}}}) at multipath.jl:182
└ @ Pathfinder ~/work/Pathfinder.jl/Pathfinder.jl/src/multipath.jl:182
rng = TaskLocalRNG(): Error During Test at /home/runner/work/Pathfinder.jl/Pathfinder.jl/test/multipath.jl:31

Supporting inputs implementing the LogDensityProblems interface

Both DynamicHMC and now also AdvancedHMC (TuringLang/AdvancedHMC.jl#301) support as inputs objects that implement the LogDensityProblems interface. It makes sense then to support as inputs objects that implement the LogDensityProblems interface. We might even consider switching to using LogDensityProblems internally.

I don't think we should switch to using LogDensityProblems internally, mainly because currently the LogDensityProblems API only supports computing the log density and its gradient, while we want to also support optimizers that compute the Hessian.

Turing is starting to use LogDensityProblems under the hood, but to keep support for Hessians, we should probably continue to use its OptimizationFunction.

It might be worth pushing to have Hessian abilities in the LogDensityProblems API.

Move WoodburyPDMat code to PDMatsExtras

The WoodburyPDMat code here is more general than that in PDMatsExtras.jl. To support the more general case, we use a different set of decompositions. If our version is not much less efficient than PDMatsExtras on the same matrices it supports, then this code should be moved out to PDMatsExtras.

Add more examples to docs

We should add at least the following examples to the docs:

banana distribution (#68)
a multi-modal distribution
donut

mlcolab / pathfinder.jl Goto Github PK

pathfinder.jl's Introduction

Pathfinder.jl: Parallel quasi-Newton variational inference

Footnotes

pathfinder.jl's People

Contributors

Stargazers

Watchers

Forkers

pathfinder.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org