Scientific Machine Learning on Julia
Here is what I’ve learned from the WorkShop Doing Scientific Machine Learning (SciML) With Julia from Chris Rackauckas. There is also an MIT course and Workshop exercises (with solutions) by the same author about this subject that I’ve been checking out and strongly recomend.
Table of Contents
- Modeling with Differential Equations
- Differential Equations
- Stochastic Differential Equations
- Delayed Differential Equations
- Callbacks
- Automated model discovery via universal differential equations
- Parameter inference on differential equations
- Local and global optimization
- Bayesian optimization
- Neural Ordinary Differential Equations with
sciml_train
- Solving for the Lokta - Volterra model with few data.
- Universal ODEs learn and extrapolate other dynamical behaviors
- Transforming a neural network fit into equations in sparsified from using SInDy
- Parameter inference on differential equations
- Solving differential equations with neural networks (physics-informed neural networks)
- Toy example
- Solving a 100 Dimensional Hamilton-Jacobi-Bellman Equation
Modeling with Differential Equations
Differential Equations
First we will see how to define a differential Equation on Julia, for this we will use the Latka Volterra equation that modelates a population of rabbits and wolves.
$$ \dfrac{dx}{dt} = \alpha x - \beta xy$$ $$ \dfrac{dy}{dt} = \delta xy - \gamma y$$
Something that may be silly but I find really nice is that you can write special caracters like 🐰, 🐺, α, β, γ and δ.
We just need to charge the package DifferentialEquations.jl
and write our differential equation as a function.
using DifferentialEquations
function lotka_volterra!(du,u,p,t)
🐰,🐺 = u
α,β,γ,δ = p
du[1] = d🐰 = α*🐰 - β*🐰*🐺
du[2] = d🐺 = γ*🐰*🐺 - δ*🐺
end
u₀ = [1.0,1.0]
tspan = (0.0, 10.0)
p = [1.5,1.0,3.0,1.0]
prob = ODEProblem(lotka_volterra!,u₀,tspan,p)
sol = solve(prob)
Easy optimizations can be made, we can choose a better solver for the problem, stop saving everystep, etc.
using Sundials # Charge CVODE_BDF() solver
sol = solve(prob, CVODE_BDF(), save_everystep=false, abstol=1e-8, reltol=1e-8)
We can also change the parameters by using the remake function.
remake(prob, p =[1.2,0.8,2.5,0.8])
Stochastic Differential Equation
Now we have a multiplicative noise, given by the terms $\sigma_i x_i dW_i$ where $dW_i$ is a random number whos standard deviation is $dt$. $$ dx = (\alpha x - \beta xy)dt + \sigma_1 x dW_1 $$ $$ dy = (\delta xy - \gamma y)dt + \sigma_2 y dW_2$$
In julia we just need create the multiplicative noise function and added to the previous problem by using SDEProblem
instead of ODEProblem
.
function multiplicative_noise!(du,u,p,t)
🐰,🐺 = u
du[1] = 0.3*🐰
du[2] = 0.3*🐺
end
prob = SDEProblem(lotka_volterra!,multiplicative_noise!,u₀,tspan,p)
sol = solve(prob)
Solving only once would not be the best, since we have a randomness, thus we made use of another set of functions already implemented in DifferentialEquations.jl
called Ensemble
. Firstly we ensemble the problem, secondly we solve for a given number of trajectories (aditonal parameters like EnsembleThreads can be written in order to parelalize the problem and get aditional performance) and finally we do a summary of what happened.
ensembleprob = EnsembleProblem(prob)
sol = solve(ensembleprob, SOSRI(), EnsembleThreads(), trajectories=1000)
summ = EnsembleSummary(sol)
Click here to see more about Ensemble
Delayed Differential Equations
The amount of growth happening at time $t$ is not due to the amount of rabbits at time $t$
Population Control
Example, whenever the amount of wolves is equal to $3$ then we are allow to kill one. The key feature to do this Callbacks
. So whenever a condition happens, then it affects the dynamics.
🔥🐺_condition(u,t,integrator) = u[2] - 3
🔥🐺_affect!(integrator) = integrator.u[2] -=1
🔥🐺_cb = ContinuousCallback(🔥🐺_condition,🔥🐺_affect!)
sol = solve(prob, callback = 🔥🐺_cb)
Of course this is not the most realistic model, since we don’t instanstly kill a wolf each time.
Automated model discovery via universal differential equations
Parameter inference on differential equations
Our goal will be to find parameters that make the Lotka-Volterra solution the one we had on the first part, so we define our loss as the squared distance from our the real solution dataset = Array(sol)
with parameters $p$ given by $\alpha = 1.5$, $\beta = 1.0$, $\gamma = 3.0$ and $\delta = 1.0$. Note that when using sciml_train
, the first return is the loss value, and the other returns are sent to the callback for monitoring convergence.
function loss(p)
tmp_prob = remake(prob, p = p)
tmp_sol = solve(tmp_prob, saveat = 0.1)
sum(abs2, Array(tmp_sol) - dataset), tmp_sol
end
Lastly, we use the sciml_train function to train the parameters using BFGS to arrive at parameters which optimize for our goal.
using DiffEqFlux
using Optim
pinit = [1.2,0.8,2.5,0.8]
res = DiffEqFlux.sciml_train(loss, pinit, BFGS(), maxiters = 100)
p_final = res.minimizer
sciml_train
allows defining a callback that will be called at each step of our training loop. It takes in the current parameter vector and the returns of the last call to the loss function. We will display the current loss and make a plot of the current situation.
using Flux
function plot_callback(p,l,tmp_sol)
tmp_prob = remake(prob, p = p)
tmp_sol = solve(tmp_prob, saveat = 0.1)
fig = plot(tmp_sol)
scatter!(fig, sol.t,dataset')
display(fig)
false
end
Let’s optimize the model and get a nice animation of what is happening in each iteration:
res = DiffEqFlux.sciml_train(loss, pinit, BFGS(), cb = plot_callback, maxiters=300)
p_final = res.minimizer
In just $1.745$ seconds and $8658251$ allocations: 361.20 MiB (counting plots) we get a loss function of $2.401364 \times 10^{-23}$ and parameters: $$\alpha = 1.5000000000009583 \approx 1.5$$ $$\beta = 1.0000000000008002 \approx 1.0$$ $$\gamma = 3.0000000000005405 \approx 3.0$$ $$\delta = 0.9999999999995174 \approx 1.0$$
Click here to see more
Notice that the election of BFGS
makes us converge quickier than using ADAM
(try this yourself). Usually ADAM
its pretty good for the first iterations to get local optima but then its better to change to BFGS
to do the final steps. Otherwise we can use BlackBoxOptim
to get global optima algorithms.
using BlackBoxOptim
res = DiffEqFlux.sciml_train(loss, pinit,
DiffEqFlux.BBO(),
cb = plot_callback,
lower_bounds= 0.5ones(4),
upper_bounds=4.0ones(4) )
After $48690$ steps we get best candidate found: $[1.5, 1.0, 3.0, 1.0]$
Bayesian Inference
In this section we will use Turing.jl
together with the documentation
Click here to see more.
Most of the scientific community deals with the basic problem of trying to mathematically model the reality around them and this often involves dynamical systems. The general trend to model these complex dynamical systems is through the use of differential equations. Differential equation models often have non-measurable parameters. The popular “forward-problem” of simulation consists of solving the differential equations for a given set of parameters, the “inverse problem” to simulation, known as parameter estimation, is the process of utilizing data to determine these model parameters. Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty.
First we set up all the packages that will be use together with a fix seed for reproducibility of the results.
using Turing, Distributions, DataFrames, DifferentialEquations, DiffEqSensitivity
# Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics.
using MCMCChains, Plots, StatsPlots
# Set a seed for reproducibility.
using Random
Random.seed!(12);
We will keep using the same Lotka-Volerra equation and we’ll generate the data to use for the parameter estimation from simulation.
odedata = Array(solve(prob,Tsit5(),saveat=0.1))
Turing and DifferentialEquations are completely composable and you can write of the differential equation inside a Turing @model
and it will just work.
We can rewrite the Lotka Volterra parameter estimation problem with a Turing @model
interface as below
Turing.setadbackend(:forward_diff) #Small Model
@model function fitlv(data)
σ ~ InverseGamma(2, 3)
α ~ truncated(Normal(1.5,0.5),0.5,2.5)
β ~ truncated(Normal(1.2,0.5),0,2)
γ ~ truncated(Normal(3.0,0.5),1,4)
δ ~ truncated(Normal(1.0,0.5),0,2)
p = [α,β,γ,δ]
prob = ODEProblem(lokta_volterra!,u₀,tspan,p)
predicted = solve(prob,Tsit5(),saveat=0.1)
for i = 1:length(predicted)
data[:,i] ~ MvNormal(predicted[i], σ) #Maximum Likehood Estimation
end
end
model = fitlv(odedata)
chain = sample(model, NUTS(.65),10000)
We just give our prior distribution and solve the dynamics to calcule our predictions and then compare it with the data in a maximum likehood estimation.
Neural Ordinary Differential Equations with sciml_train
First, lets generate the data
using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots
u0 = Float32[2.0; 0.0]
datasize = 30
tspan = (0.0f0, 1.5f0)
tsteps = range(tspan[1], tspan[2], length = datasize)
function trueODEfunc(du, u, p, t)
true_A = [-0.1 2.0; -2.0 -0.1]
du .= ((u.^3)'true_A)'
end
prob_trueode = ODEProblem(trueODEfunc, u0, tspan)
ode_data = Array(solve(prob_trueode, Tsit5(), saveat = tsteps))
Second, we create a neural network that represents some idea we know about the system.
dudt2 = FastChain((x, p) -> x.^3,
FastDense(2, 50, tanh),
FastDense(50, 2))
What the neural network is just a mathematical function with the right parameters, what the code is doing is just writting:
$ Wx^3 \rightarrow Wx^3 + b \rightarrow tanh(Wx^3+b)$ $\rightarrow W_2 tanh(Wx^3+b) \rightarrow W_2 tanh(Wx^3+b) + b_2 $
Note: For large neural networks its recommended to use Flux
instead of DiffEqFlux
.
Now we can write the ODEProblem
and solve it.
neural_ode_f(u,p,t) = dudt2(u,p)
pinit = initial_params(dudt2)
prob = ODEProblem(neural_ode_f,u0,(0.0f0,1.5f0),pinit)
sol = solve(prob, saveat = tsteps)
As we see the initial guess is not good, since we just try it to approximate by a random ODE
. Then what we need to do its find the right parameters that describe the neural network such that matches the ODE well enough. Therefore we can do it as before:
function loss(p)
tmp_prob = remake(prob,p=p)
tmp_sol = solve(tmp_prob,Tsit5(), saveat = tsteps)
sum(abs2, Array(tmp_sol) - ode_data)
end
function neuralode_callback(p,l)
tmp_prob = remake(prob,p=p)
tmp_sol = solve(tmp_prob,Tsit5(), saveat = tsteps)
fig = plot(tmp_sol)
scatter!(fig,tsteps,ode_data')
display(fig)
false
end
DiffEqFlux.sciml_train(loss, pinit, ADAM(0.05),
cb = neuralode_callback,
maxiters = 500)
We get a loss value of $0.0678$. This can be optimize by using ADAM
and then BFGS
res = DiffEqFlux.sciml_train(loss, pinit, ADAM(0.05),
cb = neuralode_callback,
maxiters = 100)
DiffEqFlux.sciml_train(loss, res.minimizer,
BFGS(initial_stepnorm=0.01),
maxiters = 100,
cb = neuralode_callback)
Now we get a loss value of $ 1.591519 \times 10^{-3}$.
We can see that most of the computational time is on the gradients. For instance Zygote.jl
and Turing.jl
take control over which algorithm is used in order to optimize performance, anyways we can always choose which one we want
see DifferentialEquations.jl documentation on Sensitivity Algorithms
for this matter.
Click here to check more about Neural ODE on the SciML ecosystem.
Universal ODEs learn and extrapolate other dynamical behaviors
Truth equation:
$$ \dot{x} = \alpha x - \beta xy$$ $$ \dot{y} = \gamma xy - \delta y$$
Partially-known neural embedded equations
$$ \dot{x} = \alpha x - U_1(x,y)$$ $$ \dot{y} = -\delta y + U_2(x,y)$$
Automatically recover the long-term behaviour from less than half of a period in a cyclick time series!
Turn neural networks back intro equations with SInDy
.
Let’s define the experimental parameter for the Lokta - Volterra equation.
tspan = (0.0f0,3.0f0)
u0 = Float32[0.44249296,4.6280594]
p_ = Float32[1.3, 0.9, 0.8, 1.8]
prob = ODEProblem(lotka, u0,tspan, p_)
solution = solve(prob, Vern7(), abstol=1e-12, reltol=1e-12, saveat = 0.1)
Then we add noise to the data so that we do not overfit.
X = Array(solution)
Xₙ = X + Float32(1e-3)*randn(eltype(X), size(X))
Define the neueral network which learns $L(x, y, y(t-\tau))$. Actually, we do not care about overfitting right now, since we want to extract the derivative information without numerical differentiation.
L = FastChain(FastDense(2, 32, tanh),FastDense(32, 32, tanh), FastDense(32, 2))
p = initial_params(L)
Let’s define now the neural network given by:
$$ \dot{x} = \alpha x - U_1(x,y)$$ $$ \dot{y} = -\delta y + U_2(x,y)$$
function dudt_(u, p,t)
x, y = u
z = L(u,p)
[p_[1]*x + z[1],
-p_[4]*y + z[2]]
end
So then when we solve
prob_nn = ODEProblem(dudt_,u0, tspan, p)
sol_nn = concrete_solve(prob_nn, Tsit5(), u0, p, saveat = solution.t)
The thick curves represent the real solution, as we see, we get a decent predictor for only the first second, afterwards the prediction for $u_1(t)$ its pretty bad, while the prediction for $u_2(t)$ it’s ok.
Let’s improve now. For this we will do as before, we perfom a prediction and then compute a loss
function on the prediction to check how well are we fitting the data. Finally we create a Callback
that saves the losses each $50$ iterations.
function predict(θ)
Array(concrete_solve(prob_nn, Vern7(), u0, θ, saveat = solution.t,
abstol=1e-6, reltol=1e-6,
sensealg = InterpolatingAdjoint(autojacvec=ReverseDiffVJP())))
end
function loss(θ)
pred = predict(θ)
sum(abs2, Xₙ .- pred), pred
end
const losses = []
callback(θ,l,pred) = begin
push!(losses, l)
if length(losses)%50==0
println("Current loss after $(length(losses)) iterations: $(losses[end])")
end
false
end
First train with ADAM
for better convergence adn then train with BFGS
res1 = DiffEqFlux.sciml_train(loss, p, ADAM(0.01), cb=callback, maxiters = 200)
res2 = DiffEqFlux.sciml_train(loss, res1.minimizer, BFGS(initial_stepnorm=0.01),
cb=callback,
maxiters = 10000)
Final training loss after $482$ iterations is $2.74 \times 10^{-4}$ and the approximation fits the real solution really well.
Notice that we purposely made the real solution curve thicker so that its easier to see, otherwise both curves are superposed.
We can also check the derivatives.
DX = Array(solution(solution.t, Val{1}))
prob_nn2 = ODEProblem(dudt_,u0, tspan, res2.minimizer)
_sol = solve(prob_nn2, Tsit5())
DX_ = Array(_sol(solution.t, Val{1}))
Finally, we know that the real functions are $\beta xy$ and $\gamma xy$. Lets check the error plot.
# Ideal data
L̄ = [-p_[2]*(X[1,:].*X[2,:])';p_[3]*(X[1,:].*X[2,:])']
# Neural network guess
L̂ = L(Xₙ,res2.minimizer)
Transforming a neural network fit into equations in sparsified from using SInDy
Now we want to use this nice fit and transformed back into equations. For this porpuse we’ll use ModelingToolkit.jl
.
We will let the model know that we have $2$ variables and then create a basis that can approximate this functions by linear combinations of $sin(u_1)$ , $cos(u_1)$, $sin(u_2)$, $cos(u_2)$, $constant$, $u_1(t)^k$, $u_2(t)^k$ and $u_1(t)^k * u_2(t)^{5-k}$ with $k = 1…5$
@variables u[1:2]
# Lots of polynomials
polys = Operation[1]
for i ∈ 1:5
push!(polys, u[1]^i)
push!(polys, u[2]^i)
for j ∈ i:5
if i != j
push!(polys, (u[1]^i)*(u[2]^j))
push!(polys, u[2]^i*u[1]^i)
end
end
end
# And some other stuff
h = [cos.(u)...; sin.(u)...; polys...]
basis = Basis(h, u)
Now we will create an optimizer for the SINDY problem Check more about Sparse Identification of Nonlinear Dynamics
# Create an optimizer for the SINDY problem
opt = SR3()
# Create the thresholds which should be used in the search process
λ = exp10.(-7:0.1:3)
# Target function to choose the results from; x = L0 of coefficients and L2-Error of the model
f_target(x, w) = iszero(x[1]) ? Inf : norm(w.*x, 2)
Let’s see what happens if we want to use pure SINDY, meaning we have no pior information, only the data generated by our neural network,i.e, we took the values of the differential equation through the time series, we run it on the neural network giving us the output $\hat{L}$ and the $X$ is the input of the neural network, then we make SINDY transform this data into a system of equations using the basis.
Ψ = SInDy(Xₙ[:, :], DX[:, :], basis, λ, opt = opt, maxiter = 10000,
f_target = f_target)
As a result we get : $$du_1 = \beta sin(u_1) + \alpha cos(u_2) + \gamma u_1^2$$ $$du_2 = \delta u_2 $$
i.e, we failed, but remember that we didn’t use any pior information.
Test on ideal derivative data for unknown function (not available).
Ψ = SInDy(Xₙ[:, 5:end], L̄[:, 5:end], basis, λ, opt = opt, maxiter = 10000,
f_target = f_target)
This time we succeded as we recovered the missing terms of each equation.
$$du_1 = \beta u_1u_2 $$ $$du_2 = \gamma u_1u_2 $$
And we also succeded using derivative data.
# Test on uode derivative data
println("SINDY on learned, partial, available data")
Ψ = SInDy(Xₙ[:, 2:end], L̂[:, 2:end], basis, λ, opt = opt, maxiter = 10000,
normalize = true,
denoise = true,
f_target = f_target)
Now we can extract the parameters.
p̂ = parameters(Ψ)
println("First parameter guess : $(p̂)")
unknown_sys = ODESystem(Ψ)
unknown_eq = ODEFunction(unknown_sys)
The equations are recovered, but the parameters may not be the best, we can start another sindy run to get closer to the ground truth.
# Just the equations
b = Basis((u, p, t)->unknown_eq(u, [1.; 1.], t), u)
# Retune for better parameters -> we could also use DiffEqFlux or other parameter estimation tools here.
Ψf = SInDy(Xₙ[:, 2:end], L̂[:, 2:end], b, opt = STRRidge(0.01), maxiter = 100, convergence_error = 1e-18) # Succeed
println(Ψf)
p̂ = parameters(Ψf)
println("Second parameter guess : $(p̂)")
# Create function
recovered_sys = ODESystem(Ψf)
recovered_eq = ODEFunction(recovered_sys)
# Build a ODE for the estimated system
function dudt(du, u, p, t)
# Add SInDy Term
α, δ, β, γ = p
z = recovered_eq(u, [β; γ], t)
du[1] = α*u[1] + z[1]
du[2] = -δ*u[2] + z[2]
end
# Create the approximated problem and solution
ps = [p_[[1,4]]; p̂]
approximate_prob = ODEProblem(dudt, u0, tspan, ps)
approximate_solution = solve(approximate_prob, Tsit5(), saveat = 0.01)
# Look at long term prediction
t_long = (0.0, 50.0)
approximate_prob = ODEProblem(dudt, u0, t_long, ps)
approximate_solution_long = solve(approximate_prob, Tsit5(), saveat = 0.1) # Using higher tolerances here results in exit of julia
true_prob = ODEProblem(lotka, u0, t_long, p_)
true_solution_long = solve(true_prob, Tsit5(), saveat = approximate_solution_long.t)
Why are neural networks so good? because for large enough neural network, local optima are global optima.
Click here to see more about Universal differential equations
Physics-Informed Neural Networks
Solve and ODE with a neural network
- Let $u' = f(u,t)$ with $u(0) = u_0$. We want to build a neural network $NN(t)$ that is the solution to this differential equation.
- By definition then, we must have that $NN'(t) = f(NN(t),t)$ and $NN(0) = u_0$.
- Define $\mathcal{C}(\theta) = \displaystyle \sum_{t} \lVert NN'(t) - f(NN(t),t) \rVert$ for $\theta$ the parameters of the ODE.
- Then this cost is zero when $NN(t)$ is the solution to the ODE
- Therefore we aim to minimize this loss to get the solution.
- Extra trick: We can use $g(t) = tNN(t) + u_0$ as a test function instead of $NN(t)$. Notice that it is an approximator that always satisfies the boundary condition.
Example
Let’s solve the ODE: $$u'(t) = cos(2\pi t) = f(u,t)$$ with initial condition $$u(0) = 1$$
Following the steps above, we create a neural network, a function $g(t)$ which we derivate and compute the $l_2$ norm of the difference between $g'(t)$ and $u'(t)$. In order to do this, we define $\epsilon$ as small as the precision of a Float32 allow us.
using Flux
NNODE = Chain( x -> [x],
Dense(1,32,tanh),
Dense(32,1),
first)
NNODE(1.0)
g(t) = t*NNODE(t) + 1f0
using Statistics
ϵ = sqrt(eps(Float32))
loss() = mean(abs2( ( ( g(t+ϵ) - g(t) )/ϵ ) - cos(2π*t) ) for t in 0f0:0.01f0:1f0)
We create a callback and train the neural network using a descent algorithm.
iter = 0
cb = function()
global iter += 1
if iter % 500 == 0
display( loss() )
end
end
opt = Flux.Descent(0.01)
data = Iterators.repeated( (), 5000 )
Flux.train!(loss, Flux.params(NNODE), data, opt; cb= cb)
Now we can compare to the real solution:
$$ u(t) = \int u'(t) dt = \int cos(2\pi t)dt = \dfrac{sin(2\pi t)}{2\pi} + constant$$
but $u(0) = 1$ therefore we get $u(t) = 1 + \dfrac{sin(2\pi t)}{2\pi}$
Why Physics- Informed Neural Networks?
- $\mathcal{C}(\theta) = C_{pde}(\theta) + C_{boundary} + C_{data}(\theta)$ can nudge a model towards data.
- Equivalent to regularizing the neural network by a scientific equation
- Can train fast continuous surrogates by making the neural network parameter dependent.
Example: Solving a 100 Dimensional Hamilton-Jacobi-Bellman Equation
For this problem we will use NeuralPDE.jl
(
Check more here). Following this steps:
- Write the function and equation.
- Make $\sigma^\top \nabla u (t,X)$ a neural network.
- Solve the resulting SDEs and learn $\sigma^\top \nabla u$ via:
using NeuralPDE
using Flux
using DifferentialEquations
using LinearAlgebra
d = 100 # number of dimensions
X0 = fill(0.0f0, d) # initial value of stochastic control process
tspan = (0.0f0, 1.0f0)
λ = 1.0f0
g(X) = log(0.5f0 + 0.5f0 * sum(X.^2))
f(X,u,σᵀ∇u,p,t) = -λ * sum(σᵀ∇u.^2)
μ_f(X,p,t) = zero(X) # Vector d x 1 λ
σ_f(X,p,t) = Diagonal(sqrt(2.0f0) * ones(Float32, d)) # Matrix d x d
prob = TerminalPDEProblem(g, f, μ_f, σ_f, X0, tspan)
hls = 10 + d # hidden layer size
opt = Flux.ADAM(0.01) # optimizer
# sub-neural network approximating solutions at the desired point
u0 = Flux.Chain(Dense(d, hls, relu),
Dense(hls, hls, relu),
Dense(hls, 1))
# sub-neural network approximating the spatial gradients at time point
σᵀ∇u = Flux.Chain(Dense(d + 1, hls, relu),
Dense(hls, hls, relu),
Dense(hls, hls, relu),
Dense(hls, d))
pdealg = NNPDENS(u0, σᵀ∇u, opt=opt)
@time ans = solve(prob, pdealg, verbose=true, maxiters=100, trajectories=100,
alg=EM(), dt=1.2, pabstol=1f-2)
We can solve this high dimensional problem in only $22.623969$ seconds $523.95 M$ allocations: $36.683 GiB$, $17.95%$ gc time.