## Publications

### Preprints

*Julia Grosse, Cheng Zhang, Philipp Hennig*

**Probabilistic DAG Search**

2021

arxiv:2106.08717 | pdf | BibTex

Exciting contemporary machine learning problems have recently been phrased in the classic formal- ism of tree search—most famously, the game of Go. Interestingly, the state-space underlying these sequential decision-making problems often posses a more general latent structure than can be captured by a tree. In this work, we develop a probabilis- tic framework to exploit a search space’s latent structure and thereby share information across the search tree. The method is based on a combina- tion of approximate inference in jointly Gaussian models for the explored part of the problem, and an abstraction for the unexplored part that imposes a reduction of complexity ad hoc. We empirically find our algorithm to compare favorably to existing non-probabilistic alternatives in Tic-Tac-Toe and a feature selection application.

*Felix Dangel, Lukas Tatzel, Philipp Hennig*

**ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure**

2021

arxiv: 2106.02624 | pdf | BibTeX

Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. As examples for ViViT's usefulness, we investigate the directional gradients and curvatures during training, and how noise information can be used to improve the stability of second-order methods.

*Filip de Roos, Carl Jidling, Adrian Wills, Thomas Schön, Philipp Hennig*

**A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization**

2021

arxiv:2102.10880 | pdf | BibTeX

Machine learning practitioners invest significant manual and computational resources in finding suitable learning rates for optimization algorithms. We provide a probabilistic motivation, in terms of Gaussian inference, for popular stochastic first-order methods. As an important special case, it recovers the Polyak step with a general metric. The inference allows us to relate the learning rate to a dimensionless quantity that can be automatically adapted during training by a control algorithm. The resulting meta-algorithm is shown to adapt learning rates in a robust manner across a large range of initial values when applied to deep learning benchmark problems.

*Frank Schneider, Felix Dangel, Philipp Hennig*

**Cockpit: A Practical Debugging Tool for Training Deep Neural Networks**

2021

arXiv:2102.06604 | pdf | BibTeX | Code

When engineers train deep learning models, they are very much "flying blind". Commonly used approaches for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, represents an important step to improve troubleshooting the training process, reveal new insights, and help develop novel methods and heuristics.

*Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic and Bharath K. Sriperumbudur*

**Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences**

2018

arXiv: 1807.02582 | pdf | BibTeX

This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.

**2021**

*Katharina Ott, Prateek Katiyar, Philipp Hennig, Michael Tiemann*

**ResNet After All: Neural ODEs and Their Numerical Solution**

International Conference on Learning Representations (ICLR), 2021

www | pdf | BibTeX | Code | Video

A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the Neural ODE concept. We observe, however, that there exists a critical step size beyond which the training yields a valid ODE vector field. We propose a method that monitors the behavior of the ODE solver during training to adapt its step size, aiming to ensure a valid ODE without unnecessarily increasing computational cost. We verify this adaption algorithm on a common bench mark dataset as well as a synthetic dataset.

*Christian Fröhlich, Alexandra Gessner, Philipp Hennig, Bernhard Schölkopf, Georgios Arvanitidis*

**Bayesian Quadrature on Riemannian Data Manifolds**

PMLR 139:3459-3468, 2021, M. Meila und T. Zhong eds.

Riemannian manifolds provide a principled way to model nonlinear geometric structure inherent in data. A Riemannian metric on said manifolds determines geometry-aware shortest paths and provides the means to define statistical models accordingly. However, these operations are typically computationally demanding. To ease this computational burden, we advocate probabilistic numerical methods for Riemannian statistics. In particular, we focus on Bayesian quadrature (BQ) to numerically compute integrals over normal laws on Riemannian manifolds learned from data. In this task, each function evaluation relies on the solution of an expensive initial value problem. We show that by leveraging both prior knowledge and an active exploration scheme, BQ significantly reduces the number of required evaluations and thus outperforms Monte Carlo methods on a wide range of integration problems. As a concrete application, we highlight the merits of adopting Riemannian geometry with our proposed framework on a nonlinear dataset from molecular dynamics.

*Robin M. Schmidt, Frank Schneider, Philipp Hennig*

**Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers**

PMLR 139:9367-9376, 2021, M. Meila und T. Zhong eds.

www | pdf | Project Page | BibTeX

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than 50,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a signifcantly reduced subset of specifc optimizers and parameter choices that generally lead to competi- tive results in our experiments: ADAM remains a strong contender, with newer methods failing to signifcantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

*Filip de Roos, Alexandra Gessner, Philipp Hennig*

**High-Dimensional Gaussian Process Inference with Derivatives**

PMLR 139:2535-2545, 2021, M. Meila and T. Zhanf, eds.

Although it is widely known that Gaussian processes can be conditioned on observations of the gradient, this functionality is of limited use due to the prohibitive computational cost of O(N^3D^3) in data points N and dimension D. The dilemma of gradient observations is that a single one of them comes at the same cost as D independent function evaluations, so the latter are often preferred. Careful scrutiny reveals, however, that derivative observations give rise to highly structured kernel Gram matrices for very general classes of kernels (inter alia, stationary kernels). We show that in the low-data regime N<D, the Gram matrix can be decomposed in a manner that reduces the cost of inference to O(N^2D+(N^2)^3) (i.e., linear in the number of dimensions) and, in special cases, to O(N^2D+N^3). This reduction in complexity opens up new use-cases for inference with gradients especially in the high-dimensional regime, where the information-to-cost ratio of gradient observations significantly increases. We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.

*Simon Bartels *

**Probabilistic Linear Algebra**

2021

Linear algebra operations are at the core of many computational tasks. For example, evaluating the density of a multivariate normal distribution requires the solution of a linear equation system and the determinant of a square matrix. Frequently, and in particular in machine learning, the size of the involved matrices is too large to compute exact solutions, and necessitate approximation. Building upon recent work (Hennig and Kiefel 2012; Hennig 2015) this thesis considers numerical linear algebra from a probabilistic perspective. Part iii establishes connections between approximate linear solvers and Gaussian inference, with a focus on projection methods. One result is the observation that solution-based inference (Cockayne, Oates, Ipsen, and Girolami 2018) is subsumed in the matrix-based inference perspective (Hennig and Kiefel 2012). Part iv shows how the probabilistic viewpoint leads to a novel algorithm for kernel least-squares problems. A Gaussian model over kernel functions is proposed that uses matrix-multiplications computed by conjugate gradients to obtain a low-rank approximation of the kernel. The derived algorithm kernel machine conjugate gradients provides empirically better approximations than conjugate gradients and, when used for Gaussian process regression, additionally provides estimates for posterior variance and log-marginal likelihood, without the need to rerun. Part v is concerned with the approximation of kernel matrix determinants. Assuming the inputs to the kernel are independent and identically distributed, a stopping condition for the Cholesky decomposition is presented that provides probably approximately correct (PAC) estimates of the log-determinant with only little overhead.

*Hans Kersting*

**Uncertainty-Aware Numerical Solutions of ODEs by Bayesian Filtering**

2021

Numerical analysis is the branch of mathematics that studies algorithms that compute approximations of well-defined, but analytically-unknown mathematical quantities. Statistical inference, on the other hand, studies which judgments can be made on unknown parameters in a statistical model. By interpreting the unknown quantity of interest as a parameter and providing a statistical model that relates it to the available numerical information (the `data'), we can thus recast any problem of numerical approximation as statistical inference. In this way, the field of probabilistic numerics introduces new 'uncertainty-aware' numerical algorithms that capture all relevant sources of uncertainty (including all numerical approximation errors) by probability distributions. While such recasts have been a decades-long success story for global optimization and quadrature (under the names of Bayesian optimization and Bayesian quadrature), the equally important numerical task of solving ordinary differential equations (ODEs) has been, until recently, largely ignored. With this dissertation, we aim to further shed light on this area of previous ignorance in three ways: Firstly, we present a first rigorous Bayesian model for initial value problems (IVPs) as statistical inference, namely as a stochastic filtering problem, which unlocks the employment of all Bayesian filters (and smoothers) to IVPs. Secondly, we theoretically analyze the properties of these new ODE filters, with a special emphasis on the convergence rates of Gaussian (Kalman) ODE filters with integrated Brownian motion prior, and explore their potential for (active) uncertainty quantification. And, thirdly, we demonstrate how employing these ODE filters as a forward simulator engenders new ODE inverse problem solvers that outperform classical 'uncertainty-unaware' ('likelihood-free') approaches. This core content is presented in Chapter 2. It is preceded by a concise introduction in Chapter 1 which conveys the necessary concepts and locates our work in the research environment of probabilistic numerics. The final Chapter 3 concludes with an in-depth discussion of our results and their implications.

*Nathanael Bosch, Philipp Hennig, Filip Tronarp*

**Calibrated Adaptive Probabilistic ODE Solvers**

AISTATS, 2021

www | pdf | BibTex | Code | Video

Probabilistic solvers for ordinary differential equations (ODEs) assign a posterior measure to the solution of an initial value problem. The joint covariance of this distribution provides an estimate of the (global) approximation error. The contraction rate of this error estimate as a function of the solver's step size identifies it as a well-calibrated worst-case error. But its explicit numerical value for a certain step size, which depends on certain parameters of this class of solvers, is not automatically a good estimate of the explicit error. Addressing this issue, we introduce, discuss, and assess several probabilistically motivated ways to calibrate the uncertainty estimate. Numerical experiments demonstrate that these calibration methods interact efficiently with adaptive step-size selection, resulting in descriptive, and efficiently computable posteriors. We demonstrate the efficiency of the methodology by benchmarking against the classic, widely used Dormand-Prince 4/5 Runge-Kutta method.

*Alonso Marco Valle, Dominik Baumann, Majid Khadiv, Philipp Hennig, Ludovic Righetti, Sebastian Trimpe*

**Robot Learning with Crash Constraints**

IEEE Robotics and Automation Letters, 2021

www | pdf | BibTex | Code | DOI: 10.1109/LRA.2021.3057055

In the past decade, numerous machine learning algorithms have been shown to successfully learn optimal policies to control real robotic systems. However, it is common to encounter failing behaviors as the learning loop progresses. Specifically, in robot applications where failing is undesired but not catastrophic, many algorithms struggle with leveraging data obtained from failures. This is usually caused by (i) the failed experiment ending prematurely, or (ii) the acquired data being scarce or corrupted. Both complicate the design of proper reward functions to penalize failures. In this paper, we propose a framework that addresses those issues. We consider failing behaviors as those that violate a constraint and address the problem of learning with crash constraints, where no data is obtained upon constraint violation. The no-data case is addressed by a novel GP model (GPCR) for the constraint that combines discrete events (failure/success) with continuous observations (only obtained upon success). We demonstrate the effectiveness of our framework on simulated benchmarks and on a real jumping quadruped, where the constraint threshold is unknown a priori. Experimental data is collected, by means of constrained Bayesian optimization, directly on the real robot. Our results outperform manual tuning and GPCR proves useful on estimating the constraint threshold.

*Filip Tronarp, Simo Sarkka, Philipp Hennig*

**Bayesian ODE Solvers: The Maximum A Posteriori Estimate**

Statistics and Computing 31(3): Article Nr. 23, 2021

There is a growing interest in probabilistic numerical solutions to ordinary differential equations. In this paper, the maximum a posteriori estimate is studied under the class of ν times differentiable linear time invariant Gauss–Markov priors, which can be computed with an iterated extended Kalman smoother. The maximum a posteriori estimate corresponds to an optimal interpolant in the reproducing kernel Hilbert space associated with the prior, which in the present case is equivalent to a Sobolev space of smoothness ν + 1. Subject to mild conditions on the vector field, convergence rates of the maximum a posteriori estmate are then obtained via methods from nonlinear analysis and scattered data approximation. These results closely resemble classical convergence results in the sense that a ν times differentiable prior process obtains a global order of ν, which is demonstrated in numerical examples.

## 2020

*Hans Kersting, T.J. Sullivan, Philipp Hennig*

**Convergence Rates of Gaussian ODE Filters**

Statistics and Computing, 30(6):1791–1816 , 2020

www | pdf | BibTeX | Project Page

A recently-introduced class of probabilistic (uncertainty-aware) solvers for ordinary differential equations (ODEs) applies Gaussian (Kalman) filtering to initial value problems. These methods model the true solution *x* and its first *q* derivatives a priori as a Gauss-Markov process **X**, which is then iteratively conditioned on information about x'. We prove worst-case local convergence rates of order *h ^{q+1}* for a wide range of versions of this Gaussian ODE filter, as well as global convergence rates of order

*h*in the case of

^{q}*q=1*and an integrated Brownian motion prior, and analyze how inaccurate information on

*x'*coming from approximate evaluations of

*f*affects these rates. Moreover, we present explicit formulas for the steady states and show that the posterior confidence intervals are well calibrated in all considered cases that exhibit global convergence—in the sense that they globally contract at the same rate as the truncation error.

*Jonathan Wenger, Philipp Hennig*

**Probabilistic Linear Solvers for Machine Learning**

Advances in Neural Information Processing Systems (NeurIPS) 2020

www | pdf | BibTex | Code | Video

Linear systems are the bedrock of virtually all numerical computation. Machine learning poses specific challenges for the solution of such systems due to their scale, characteristic structure, stochasticity and the central role of uncertainty in the field. Unifying earlier work we propose a class of probabilistic linear solvers which jointly infer the matrix, its inverse and the solution from matrix-vector product observations. This class emerges from a fundamental set of desiderata which constrains the space of possible algorithms and recovers the method of conjugate gradients under certain conditions. We demonstrate how to incorporate prior spectral information in order to calibrate uncertainty and experimentally showcase the potential of such solvers for machine learning.

*Niklas Wahl, Philipp Hennig, Hans‐Peter Wieser, Mark Bangert*

**Analytical probabilistic modeling of dose‐volume histograms**

Medical Physics, 2020, Volume: 47, Issue: 10, pages: 5260-5273

Purpose

Radiotherapy, especially with charged particles, is sensitive to executional and preparational uncertainties that propagate to uncertainty in dose and plan quality indicators, e. g., dose‐volume histograms (DVHs). Current approaches to quantify and mitigate such uncertainties rely on explicitly computed error scenarios and are thus subject to statistical uncertainty and limitations regarding the underlying uncertainty model. Here we present an alternative, analytical method to approximate moments, in particular expectation value and (co)variance, of the probability distribution of DVH‐points, and evaluate its accuracy on patient data.

Methods

We use Analytical Probabilistic Modeling (APM) to derive moments of the probability distribution over individual DVH‐points based on the probability distribution over dose. By using the computed moments to parameterize distinct probability distributions over DVH‐points (here normal or beta distributions), not only the moments but also percentiles, i. e., *α*‐DVHs, are computed. The model is subsequently evaluated on three patient cases (intracranial, paraspinal, prostate) in 30‐ and singlefraction scenarios by assuming the dose to follow a multivariate normal distribution, whose moments are computed in closed‐form with APM. The results are compared to a benchmark based on discrete random sampling.

Results

The evaluation of the new probabilistic model on the three patient cases against a sampling benchmark proves its correctness under perfect assumptions as well as good agreement in realistic conditions. More precisely, ca. 90% of all computed expected DVH‐points and their standard deviations agree within 1% volume with their empirical counterpart from sampling computations, for both fractionated and single fraction treatments. When computing *α*‐DVHs, the assumption of a beta distribution achieved better agreement with empirical percentiles than the assumption of a normal distribution: While in both cases probabilities locally showed large deviations (up to ±0.2), the respective *α *‐DVHs for *α *= {0:05; 0:5; 0:95} only showed small deviations in respective volume (up to ±5% volume for a normal distribution, and up to 2% for a beta distribution). A previously published model from literature, which was included for comparison, exhibited substantially larger deviations.

Conclusions

With APM we could derive a mathematically exact description of moments of probability distributions over DVH‐points given a probability distribution over dose. The model generalizes previous attempts and performs well for both choices of probability distributions, i. e., normal or beta distributions, over DVH‐points.

*Hans Kersting, Maren Mahsereci*

**A Fourier State Space Model for Bayesian ODE Filters**

www | pdf | BibTeX | Project Page | Video

ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2020

Gaussian ODE filtering is a probabilistic numerical method to solve ordinary differential equations (ODEs). It computes a Bayesian posterior over the solution from evaluations of the vector field defining the ODE. Its most popular version, which employs an integrated Brownian motion prior, uses Taylor expansions of the mean to extrapolate forward and has the same convergence rates as classical numerical methods. As the solution of many important ODEs are periodic functions (oscillators), we raise the question whether Fourier expansions can also be brought to bear within the framework of Gaussian ODE filtering. To this end, we construct a Fourier state space model for ODEs and a 'hybrid' model that combines a Taylor (Brownian motion) and Fourier state space model. We show by experiments how the hybrid model might become useful in cheaply predicting until the end of the time domain.

*Augustinus Kristiadi, Matthias Hein, Philipp Hennig *

**Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks**

International Conference on Machine Learning (ICML) 2020

www | pdf | BibTeX | Code | Video

The point estimates of ReLU classification networks---arguably the most widely used neural network architecture---have been shown to yield arbitrarily high confidence far away from the training data. This architecture, in conjunction with a maximum a posteriori estimation scheme, is thus not calibrated nor robust. Approximate Bayesian inference has been empirically demonstrated to improve predictive uncertainty in neural networks, although the theoretical analysis of such Bayesian approximations is limited. We theoretically analyze approximate Gaussian distributions on the weights of ReLU networks and show that they fix the overconfidence problem. Furthermore, we show that even a simplistic, thus cheap, Bayesian approximation, also fixes these issues. This indicates that a sufficient condition for a calibrated uncertainty on a ReLU network is ``to be a bit Bayesian''. These theoretical results validate the usage of last-layer Bayesian approximation and motivate a range of a fidelity-cost trade-off. We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.

*Hans Kersting, Nicholas Krämer, Martin Schiegg, Christian Daniel, Michael Tiemann, Philipp Hennig*

**Differentiable Likelihoods for Fast Inversion of 'Likelihood-Free' Dynamical Systems**

International Conference on Machine Learning (ICML), 2020

Likelihood-free (a.k.a. simulation-based) inference problems are inverse problems with expensive, or intractable, forward models. ODE inverse problems are commonly treated as likelihood-free, as their forward map has to be numerically approximated by an ODE solver. This, however, is not a fundamental constraint but just a lack of functionality in classic ODE solvers, which do not return a likelihood but a point estimate. To address this shortcoming, we employ Gaussian ODE filtering (a probabilistic numerical method for ODEs) to construct a local Gaussian approximation to the likelihood. This approximation yields tractable estimators for the gradient and Hessian of the (log-)likelihood. Insertion of these estimators into existing gradient-based optimization and sampling methods engenders new solvers for ODE inverse problems. We demonstrate that these methods outperform standard likelihood-free approaches on three benchmark-systems.

*Simon Bartels, Philipp Hennig*

**Conjugate Gradients for Kernel Machines**

Journal of Machine Learning Research (2020) 21: 1- 42

Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning.

*Alexandra Gessner, Oindrila Kanjilal, Philipp Hennig*

**Integrals over Gaussians under Linear Domain Constraints**

Advances in Artificial Intelligence and Statistics (AISTATS), 2020

Integrals of linearly constrained multivariate Gaussian densities are a frequent problem in machine learning and statistics, arising in tasks like generalized linear models and Bayesian optimization. Yet they are notoriously hard to compute, and to further complicate matters, the numerical values of such integrals may be very small. We present an efficient black-box algorithm that exploits geometry for the estimation of integrals over a small, truncated Gaussian volume, and to simulate therefrom. Our algorithm uses the Holmes-Diaconis-Ross (HDR) method combined with an analytic version of elliptical slice sampling (ESS). Adapted to the linear setting, ESS allows for rejection-free sampling, because intersections of ellipses and domain boundaries have closed-form solutions. The key idea of HDR is to decompose the integral into easier-to-compute conditional probabilities by using a sequence of nested domains. Remarkably, it allows for direct computation of the logarithm of the integral value and thus enables the computation of extremely small probability masses. We demonstrate the effectiveness of our tailored combination of HDR and ESS on high-dimensional integrals and on entropy search for Bayesian optimization.

*Felix Dangel, Stefan Harmeling*,* Philipp Hennig *

**Modular Block-diagonal Curvature Approximations for Feedforward Architectures**

Advances in Artificial Intelligence and Statistics (AISTATS)

PMLR 108:799-808, 2020

www | pdf | BibTeX | Code | Video

We propose a modular extension of backpropagation for computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian). The approach reduces the otherwise tedious manual derivation of these matrices into local modules, and is easy to integrate into existing machine learning libraries. Moreover, we develop a compact notation derived from matrix differential calculus. We outline different strategies applicable to our method. They subsume recently-proposed block-diagonal approximations as special cases, and we extend the concepts presented therein to convolutional neural networks.

*Felix Dangel, Frederik Kunstner, Philipp Hennig*

**BackPACK: Packing more into Backprop**

International Conference on Learning Representations (ICLR), 2020

www | pdf | BibTeX | Project Page | Code | Video

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. Manually implementing them is burdensome, inefficient if done naively, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. To address this problem, we introduce BackPACK, an efficient framework built on top of PyTorch, that extends the backpropagation algorithm to extract additional information from first- and second-order derivatives. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization.

## 2019

*Frederik Kunstner, Lukas Balles, Philipp Hennig *

**Limitations of the Empirical Fisher Approximation for natural gradient descent**

in H. Wallach, H. Larochelle, A. Beygelzimmer, F. d'Alche-Buc, E. Fox and R. Garnett, eds.

*Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)*

www | pdf | supplements (zip) | | bibtex | Code | Project Page | Image | video

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher—unlike the Fisher—does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.

*Motonobu Kanagawa, Philipp Hennig*

**Convergence Guarantees for Adaptive Bayesian Quadrature Methods**

in H. Wallach, H. Larochelle, A. Beygelzimmer, F. d'Alche-Buc, E. Fox and R. Garnett, eds.

*Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)*

www | pdf | supplements (zip) | bibtex | slides |

Adaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integration that empirically compares favorably with Monte Carlo integration on problems of medium dimensionality (where non- adaptive quadrature is not competitive). Its key ingredient is an acquisition function that changes as a function of previously collected values of the integrand. While this adaptivity appears to be empirically powerful, it complicates analysis. Consequently, there are no theoretical guarantees so far for this class of methods. In this work, for a broad class of adaptive Bayesian quadrature methods, we prove consistency, deriving non-tight but informative convergence rates. To do so we introduce a new concept we call weak adaptivity. In guaranteeing consistency of ABQ, weak adaptivity is notionally similar to the ideas ofdetailed balance and ergodicity in Markov Chain Monte Carlo methods, which allow sufficient conditions for consistency of MCMC. Likewise, our results identify a large and flexible class of adaptive Bayesian quadrature rules as consistent, within which practitioners can develop empirically efficient methods.

*Allesandro Motta, Manuel Berning, Kevin M. Boergens, Benedikt Staffler, Marcel Beining, Sahil Loomba, Philipp Hennig, Heiko Wissler, Moritz Helmstaedter*

**Dense Connectomic Reconstruction in Layer 4 of the Somatosensory Cortex**

Science, 29. November 2019; Vol. 366, Issue 6469 eaay3134

www | pdf | BibTeX | video | press release

The dense circuit structure of mammalian cerebral cortex is still unknown. With developments in three- dimensional electron microscopy, the imaging of sizeable volumes of neuropil has become possible, but dense reconstruction of connectomes is the limiting step. We reconstructed a volume of ~500,000 μm3 from layer 4 of mouse barrel cortex, ~300 times larger than previous dense reconstructions from the mammalian cerebral cortex. The connectomic data allowed the extraction of inhibitory and excitatory neuron subtypes that were not predictable from geometric information. We quantified connectomic imprints consistent with Hebbian synaptic weight adaptation, which yielded upper bounds for the fraction of the circuit consistent with saturated long-term potentiation. These data establish an approach for connectomic phenotyping of local dense neuronal circuitry in the mammalian cortex.

*Alexandra Gessner, Javier Gonzalez, Maren Mahsereci*

**Active Multi-Information Source Bayesian Quadrature **

In Proceedings Conference on Uncertainty in Artificial Intelligence (UAI) 2019, Tel Aviv

PMLR 115:712-721

Bayesian quadrature (BQ) is a sample-efficient probabilistic numerical method to solve integrals of expensive-to-evaluate black-box functions, yet so far,active BQ learning schemes focus merely on the integrand itself as information source, and do not allow for information transfer from cheaper, related functions. Here, we set the scene for active learning in BQ when multiple related information sources of variable cost (in input and source) are accessible. This setting arises for example when evaluating the integrand requires a complex simulation to be run that can be approximated by simulating at lower levels of sophistication and at lesser expense. We construct meaningful cost-sensitive multi-source acquisition rates as an extension to common utility functions from vanilla BQ (VBQ),and discuss pitfalls that arise from blindly generalizing. Furthermore, we show that the VBQ acquisition policy is a corner-case of all considered cost-sensitive acquisition schemes, which collapse onto one single de-generate policy in the case of one source and constant cost. In proof-of-concept experiments we scrutinize the behavior of our generalized acquisition functions. On an epidemiological model, we demonstrate that active multi-source BQ (AMS-BQ) allocates budget more efficiently than VBQ for learning the integral to a good accuracy.

*Toni Karvonen, Motonobu Kanagawa, Simo Särkkä*

**On the positivity and magnitudes of Bayesian quadrature weights**

Statistics and Computing (2019) 29:1317–1333

This article reviews and studies the properties of Bayesian quadrature weights, which strongly affect stability and robustness of the quadrature rule. Specifically, we investigate conditions that are needed to guarantee that the weights are positive or to bound their magnitudes. First, it is shown that the weights are positive in the univariate case if the design points locally minimise the posterior integral variance and the covariance kernel is totally positive (e.g. Gaussian and Hardy kernels). This suggests that gradient-based optimisation of design points may be effective in constructing stable and robust Bayesian quadrature rules. Secondly, we show that magnitudes of the weights admit an upper bound in terms of the fill distance and separation radius if the RKHS of the kernel is a Sobolev space (e.g. Matérn kernels), suggesting that quasi-uniform points should be used. A number of numerical examples demonstrate that significant generalisations and improvements appear to be possible, manifesting the need for further research.

*Yu Nishiyama, Motonobu Kanagawa, Arthur Gretton, Kenji Fukumizu*

**Model-based Kernel Sum Rule: Kernel Bayesian Inference with Probabilistic Models**

Machine Learning, volume 109, pages 939–972

Kernel Bayesian inference is a powerful nonparametric approach to performing Bayesian inference in reproducing kernel Hilbert spaces or feature spaces. In this approach, kernel means are estimated instead of probability distributions, and these estimates can be used for subsequent probabilistic operations (as for inference in graphical models) or in computing the expectations of smooth functions, for instance. Various algorithms for kernel Bayesian inference have been obtained by combining basic rules such as the kernel sum rule (KSR), kernel chain rule, kernel product rule and kernel Bayes' rule. However, the current framework only deals with fully nonparametric inference (i.e., all conditional relations are learned nonparametrically), and it does not allow for flexible combinations of nonparametric and parametric inference, which are practically important. Our contribution is in providing a novel technique to realize such combinations. We introduce a new KSR referred to as the model-based KSR (Mb-KSR), which employs the sum rule in feature spaces under a parametric setting. Incorporating the Mb-KSR into existing kernel Bayesian framework provides a richer framework for hybrid (nonparametric and parametric) kernel Bayesian inference. As a practical application, we propose a novel filtering algorithm for state space models based on the Mb-KSR, which combines the nonparametric learning of an observation process using kernel mean embedding and the additive Gaussian noise model for a state transition process. While we focus on additive Gaussian noise models in this study, the idea can be extended to other noise models, such as the Cauchy and alpha-stable noise models.

*Simon Bartels, Jon Cockayne, Ilse C. F. Ipsen, Philipp Hennig*

**Probabilistic linear solvers: a unifying view**

Statistics and Computing, November 2019, Volume 29, Issue 6, pp 1249–1263, Eds.:M. Girolami, I.C.F. Ipsen, C. J.Oates, A. B. Owen, T. J.Sullivan

Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behaviour of the conjugate gradient method as a prototypical iterative method. In this work, surprisingly general conditions for equivalence of these disparate methods are presented. We also describe connections between probabilistic linear solvers and projection methods for linear systems, providing a probabilistic interpretation of a far more general class of iterative methods. In particular, this provides such an interpretation of the generalised minimum residual method. A probabilistic view of preconditioning is also introduced. These developments unify the literature on probabilistic linear solvers and provide foundational connections to the literature on iterative solvers for linear systems.

*Filip Tronarp, Hans Kersting, Simo Särkkä, Philipp Hennig*

**Probabilistic Solutions To Ordinary Differential Equations As Non-Linear Bayesian Filtering: A New Perspective**

2019, Statistics and Computing (STCO), Volume 29, Issue 6, pp 1297-1315, Eds.: M. Girolami, I. C. F. Ipsen, C. J. Oates, A. B. Owen, T. J. Sullivan

www | pdf | Project Page | BibTeX

We formulate probabilistic numerical approximations to solutions of ordinary differential equations (ODEs) as problems in Gaussian process (GP) regression with non-linear measurement functions. This is achieved by defining the measurement sequence to consists of the observations of the difference between the derivative of the GP and the vector field evaluated at the GP---which are all identically zero at the solution of the ODE. When the GP has a state-space representation, the problem can be reduced to a Bayesian state estimation problem and all widely-used approximations to the Bayesian filtering and smoothing problems become applicable. Furthermore, all previous GP-based ODE solvers, which were formulated in terms of generating synthetic measurements of the vector field, come out as specific approximations. We derive novel solvers, both Gaussian and non-Gaussian, from the Bayesian state estimation problem posed in this paper and compare them with other probabilistic solvers in illustrative experiments.

*Filip de Roos, Philipp Hennig*

**Active Probabilistic Inference on Matrices for Pre-Conditioning in Stochastic Optimization**

Advances in Artificial Intelligence and Statistics (AISTATS), 2019

Pre-conditioning is a well-known concept that can significantly improve the convergence of optimization algorithms. For noise-free problems, where good pre-conditioners are not known a priori, iterative linear algebra methods offer one way to efficiently construct them. For the stochastic optimization problems that dominate contemporary machine learning, however, this approach is not readily available. We propose an iterative algorithm inspired by classic iterative linear solvers that uses a probabilistic model to actively infer a pre-conditioner in situations where Hessian-projections can only be constructed with strong Gaussian noise. The algorithm is empirically demonstrated to efficiently construct effective pre-conditioners for stochastic gradient descent and its variants. Experiments on problems of comparably low dimensionality show improved convergence. In very high-dimensional problems, such as those encountered in deep learning, the pre-conditioner effectively becomes an automatic learning-rate adaptation scheme, which we also show to empirically work well.

Georgios Arvanitidis, Søren Hauberg, Philipp Hennig, Michael Schober

**Fast and Robust Shortest Paths on Manifolds Learned from Data**

Advances in Artificial Intelligence and Statistics (AISTATS)

PMLR 89:1506-1515, 2019

www | pdf | supplements | BibTeX

We propose a fast, simple and robust algorithm for computing shortest paths and distances on Riemannian manifolds learned from data. This amounts to solving a system of ordinary differential equations (ODEs) subject to boundary conditions. Here standard solvers perform poorly because they require well-behaved Jacobians of the ODE, and usually, manifolds learned from data imply unstable and ill-conditioned Jacobians. Instead, we propose a fixed-point iteration scheme for solving the ODE that avoids Jacobians. This enhances the stability of the solver, while reduces the computational cost. In experiments involving both Riemannian metric learning and deep generative models we demonstrate significant improvements in speed and stability over both general-purpose state-of-the-art solvers as well as over specialized solvers.

*Frank Schneider, Lukas Balles and Philipp Hennig*

**DeepOBS: A Deep Learning Optimizer Benchmark Suite**

International Conference on Learning Representations (ICLR), 2019

www | pdf | BibTeX | Codes | arXiv:1903.05499

Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It is written in TensorFlow and available open source.

*Motonobu Kanagawa, Bharath K. Sriperumbudur, Kenji Fukumizu*

**Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings**

Foundations of Computational Mathematics, 2019

This paper presents convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand.

### 2018

*Takafumi Kajihara, Motonobu Kanagawa, Keisuke Yamazaki, Kenji Fukumizu*

**Kernel Recursive ABC: Point Estimation with Intractable Likelihood**

Proceedings of the 35th International Conference on Machine Learning, pp.: 2405 – 2414, PMLR, July 2018

We propose a novel approach to parameter estimation for simulator-based statistical models with intractable likelihood. Our proposed method involves recursive application of kernel ABC and kernel herding to the same observed data. We provide a theoretical explanation regarding why the approach works, showing (for the population setting) that, under a certain assumption, point estimates obtained with this method converge to the true parameter, as recursion proceeds. We have conducted a variety of numerical experiments, including parameter estimation for a real-world pedestrian flow simulator, and show that in most cases our method outperforms existing approaches.

*Lukas Balles & Philipp Hennig*

**Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients**

Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR 80: pp. 404-413, 2018

www | pdf | BibTeX

The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn't. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of stochastic gradients, whereas the update magnitude is determined by an estimate of their relative variance. We disentangle these two aspects and analyze them in isolation, gaining insight into the mechanisms underlying ADAM. This analysis also extends recent results on adverse effects of ADAM on generalization, isolating the sign aspect as the problematic one. Transferring the variance adaptation to SGD gives rise to a novel method, completing the practitioner's toolbox for problems where ADAM fails.

*Michael Schober, Simo Särkkä, Philipp Hennig*

**A probabilistic model for the numerical solution of initial value problems**

Statistics and Computing, 2018

www | pdf | BibTeX

We study connections between ordinary differential equation (ODE) solvers and probabilistic regression methods in statistics. We provide a new view of probabilistic ODE solvers as active inference agents operating on stochastic differential equation models that estimate the unknown initial value problem (IVP) solution from approximate observations of the solution deriva- tive, as provided by the ODE dynamics. Adding to this picture, we show that several multistep methods of Nordsieck form can be recasted as Kalman filtering on q-times integrated Wiener processes. Doing so provides a family of IVP solvers that return a Gaussian posterior measure, rather than a point estimate. We show that some such methods have low computational overhead, nontrivial convergence order, and that the posterior has a calibrated concentration rate. Additionally, we suggest a step size adaptation algorithm which completes the proposed method to a practically useful implementation, which we experimentally evaluate using a representative set of standard codes in the DETEST benchmark set.

*Niklas Wahl, Philipp Hennig, Hans-Peter Wieser, Mark Bangert*

**Analytical incorporation of fractionation effects in probabilistic treatment planning for intensity‐modulated proton therapy**

Med. Phys. 45: 1317–1328

www | pdf | BibTeX

We show that it is possible to explicitly incorporate fractionation effects into closed‐form probabilistic treatment plan analysis and optimization for intensity‐modulated proton therapy with analytical probabilistic modeling (APM). We study the impact of different fractionation schemes on the dosimetric uncertainty induced by random and systematic sources of range and setup uncertainty for treatment plans that were optimized with and without consideration of the number of treatment fractions.

### 2017

*Maren Mahsereci & Philipp Hennig*

**Probabilistic Line Searches for Stochastic Optimization**

Journal of Machine Learning Research (JMLR) 18(119): 1–59, 2017

www | pdf | BibTeX

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user- controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

*Lukas Balles, Javier Romero &Philipp Hennig*

**Coupling Adaptive Batch Sizes with Learning Rates**

In Proceedings Conference on Uncertainty in Artificial Intelligence (UAI) 2017, pp.: 410 – 419, Association for Uncertainty in Artificial Intelligence (AUAI), Conference on Uncertainty in Artificial Intelligence (UAI), Gal Elidan and Kristian Kersting, eds, August 2017

www | pdf | BibTeX | Code | Project Page

Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On three image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

*Michael Schober, Amit Adam, Omer Yair, Shai Mazor, Sebastian Nowozin*

**Dynamic Time-of-Flight**

IEEE Conference in Computer Vison and Pattern Recognition (CVPR), pp.: 170 – 179, 2017

Time-of-flight (TOF) depth cameras provide robust depth inference at low power requirements in a wide variety of consumer and industrial applications. These cameras reconstruct a single depth frame from a given set of infrared (IR) frames captured over a very short exposure period. Operating in this mode the camera essentially forgets all information previously captured - and performs depth inference from scratch for every frame. We challenge this practice and propose using previously captured information when inferring depth. An inherent problem we have to address is camera motion over this longer period of collecting observations. We derive a probabilistic framework combining a simple but robust model of camera and object motion, together with an observation model. This combination allows us to integrate information over multiple frames while remaining robust to rapid changes. Operating the camera in this manner has implications in terms of both computational efficiency and how information should be captured. We address these two issues and demonstrate a realtime TOF system with robust temporal integration that improves depth accuracy over strong baseline methods including adaptive spatio-temporal filters.

*Alonso Marco Valle, Felix Berkenkamp, Philipp Hennig, Angela P. Schoellig, Andreas Krause, Stefan Schaal, Sebastian Trimpe*

**Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization**

Proceedings 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.: 1557 – 1563, 2017

www | pdf | BibTeX | Video 1 | Video 2 | Project Page | arXiv

In practice, the parameters of control policies are often tuned manually. This is time-consuming and frustrating. Reinforcement learning is a promising alternative that aims to automate this process, yet often requires too many experiments to be practical. In this paper, we propose a solution to this problem by exploiting prior knowledge from simulations, which are readily available for most robotic platforms. Specifically, we extend Entropy Search, a Bayesian optimization algorithm that maximizes information gain from each experiment, to the case of multiple information sources. The result is a principled way to automatically combine cheap, but inaccurate information from simulations with expensive and accurate physical experiments in a cost-effective manner. We apply the resulting method to a cart-pole system, which confirms that the algorithm can find good control policies with fewer experiments than standard Bayesian optimization on the physical system only.

*Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, Frank Hutter *

**Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets**

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), 54, pages: 528-536, in Proceedings of Machine Learning Research, eds: Aarti Sign and Jerry Zhu, PMLR, 2017

Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success, for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.

*Motonobu Kanagawa, Bharath K. Sriperumbudur, Kenji Fukumizu *

**Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings**

Arxiv e-prints, arXiv:1709.00147v1 [math.NA], 2017

This paper presents convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between distance design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand.

*Edgar Klenske*

**Nonparametric Disturbance Correction and Nonlinear Dual Control**

ETH Zürich, 2017

Automatic control is an important aspect of modern technology, and many devices we use on a daily basis are using automatic control for actuation and decision-making. However, many advanced auto- matic control methods need a model of the system to control—a mathematical representation of the system’s behavior. These models are not always easy to come by because of the underlying complexity of the system or the required measurement precision. Therefore, often a big portion of time is used for identification and tuning of these models.

Machine learning methods with the available regression and inference frameworks offer a new potential in the combination with model-based control methods to speed up, or even entirely automate, the process of model-creation, identification and tuning. This potential similarly extends to disturbance prediction: Methods from time-series forecasting can be used to infer a model of the environmental disturbances, which then can be used in predictive control methods.

*Paul Rubenstein, Ilya Tolstikhin, Philipp Hennig, Bernhard Schölkopf*

**Probabilistic Active Learning of Functions in Structural Causal Models**

We consider the problem of learning the functions computing children from parents in a Structural Causal Model once the underlying causal graph has been identified. This is in some sense the second step after causal discovery. Taking a probabilistic approach to estimating these functions, we derive a natural myopic active learning scheme that identifies the intervention which is optimally informative about all of the unknown functions jointly, given previously observed data. We test the derived algorithms on simple examples, to demonstrate that they produce a structured exploration policy that significantly improves on unstructured base-lines.

*Arthur Gretton, Philipp Hennig, Carl Edward Rasmussen, Bernhard Schölkopf*

**New Directions for Learning with Kernels and Gaussian Processes (Dagstuhl Seminar 16481)**

Dagstuhl Reports, 6(11):142-167, 2017

The Dagstuhl Seminar on 16481 "New Directions for Learning with Kernels and Gaussian Processes" brought together two principal theoretical camps of the machine learning community at a crucial time for the field. Kernel methods and Gaussian process models together form a significant part of the discipline's foundations, but their prominence is waning while more elaborate but poorly understood hierarchical models are ascendant. In a lively, amiable seminar, the participants re-discovered common conceptual ground (and some continued points of disagreement) and productively discussed how theoretical rigour can stay relevant during a hectic phase for the subject.

*Wahl, Hennig, Wieser, Bangert*

**Efficiency of analytical and sampling-based uncertainty propagation in intensity-modulated proton therapy**

Physics in Medicine & Biology, 62(14):5790-5807, 2017

www | pdf | BibTeX | Project Page

The sensitivity of intensity-modulated proton therapy (IMPT) treatment plans to uncertainties can be quantified and mitigated with robust/min-max and stochastic/probabilistic treatment analysis and optimization techniques. Those methods usually rely on sparse random, importance, or worst-case sampling. Inevitably, this imposes a trade-off between computational speed and accuracy of the uncertainty propagation. Here, we investigate analytical probabilistic modeling (APM) as an alternative for uncertainty propagation and minimization in IMPT that does not rely on scenario sampling. APM propagates probability distributions over range and setup uncertainties via a Gaussian pencil-beam approximation into moments of the probability distributions over the resulting dose in closed form. It supports arbitrary correlation models and allows for efficient incorporation of fractionation effects regarding random and systematic errors. We evaluate the trade-off between run-time and accuracy of APM uncertainty computations on three patient datasets. Results are compared against reference computations facilitating importance and random sampling. Two approximation techniques to accelerate uncertainty propagation and minimization based on probabilistic treatment plan optimization are presented. Runtimes are measured on CPU and GPU platforms, dosimetric accuracy is quantified in comparison to a sampling-based benchmark (5000 random samples). APM accurately propagates range and setup uncertainties into dose uncertainties at competitive run-times (GPU ##IMG## [http://ej.iop.org/images/0031-9155/62/14/5790/pmbaa6ec5ieqn001.gif] {⩽5} min). The resulting standard deviation (expectation value) of dose show average global ##IMG## [http://ej.iop.org/images/0031-9155/62/14/5790/pmbaa6ec5ieqn002.gif] {γ3%/3 mm} pass rates between 94.2% and 99.9% (98.4% and 100.0%). All investigated importance sampling strategies provided less accuracy at higher run-times considering only a single fraction. Considering fractionation, APM uncertainty propagation and treatment plan optimization was proven to be possible at constant time complexity, while run-times of sampling-based computations are linear in the number of fractions. Using sum sampling within APM, uncertainty propagation can only be accelerated at the cost of reduced accuracy in variance calculations. For probabilistic plan optimization, we were able to approximate the necessary pre-computations within seconds, yielding treatment plans of similar quality as gained from exact uncertainty propagation. APM is suited to enhance the trade-off between speed and accuracy in uncertainty propagation and probabilistic treatment plan optimization, especially in the context of fractionation. This brings fully-fledged APM computations within reach of clinical application.

*Hans-Peter Wieser, Philipp Hennig, Niklas Wahl, Mark Bangert*

**Analytical probabilistic modeling of RBE-weighted dose for ion therapy**

Physics in Medicine and Biology (PMB), 62(23):8959-8982, 2017

www | pdf | BibTeX | Project Page

Particle therapy is especially prone to uncertainties. This issue is usually addressed with uncertainty quantification and minimization techniques based on scenario sampling. For proton therapy, however, it was recently shown that it is also possible to use closed-form computations based on analytical probabilistic modeling (APM) for this purpose. APM yields unique features compared to sampling-based approaches, motivating further research in this context.

## 2016

*Edgar Klenske, Philipp Hennig, Bernard Schölkopf, Melanie Zeilinger*

**Approximate dual control maintaining the value of information with an application to building control**

European Control Conference (ECC), pp.: 800 – 806**, **2016

www | pdf | BibTeX | Project Page

Dual control, the simultaneous identification and control of dynamic systems, is an idea that has been around for several decades without being widely used in applications, due to the fundamental intractability of the optimal solution. Available algorithms are either complex and computationally demanding, or reduce to a simple change of the cost function, which can lead to poor average performance. In addition, classic dual control schemes do not deal with constraints and economic cost structures present in many applications. In this paper, we aim at facilitating the use of dual control algorithms based on a series expansion of the cost function for such systems. In particular, this is realized by employing reference tracking of the optimal mean trajectory together with soft constraints. A key feature of the proposed formulation is that it maintains the value of information in the cost, making dual control tractable with all dual features. Evaluation on a simulated building control problem exhibits an advantage of using the proposed dual controller over simplified solutions.

*Hans Kersting, Philipp Hennig*

**Active Uncertainty Calibration in Bayesian ODE Solvers**

Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI 2016), pp.: 309 – 318, 2016, eds: Ihler, A. and Janzing, D., AUAI Press, June 2016 (conference)

www | pdf | BibTeX | Project Page

There is resurging interest, in statistics and ma- chine learning, in solvers for ordinary differential equations (ODEs) that return probability mea- sures instead of point estimates. Recently, Con- rad et al. introduced a sampling-based class of methods that are ‘well-calibrated’ in a specific sense. But the computational cost of these meth- ods is significantly above that of classic meth- ods. On the other hand, Schober et al. pointed out a precise connection between classic Runge- Kutta ODE solvers and Gaussian filters, which gives only a rough probabilistic calibration, but at negligible cost overhead. By formulating the solution of ODEs as approximate inference in linear Gaussian SDEs, we investigate a range of probabilistic ODE solvers, that bridge the trade- off between computational cost and probabilistic calibration, and identify the inaccurate gradient measurement as the crucial source of uncertainty. We propose the novel filtering-based method Bayesian Quadrature filtering (BQF) which uses Bayesian quadrature to actively learn the impre- cision in the gradient measurement by collecting multiple gradient evaluations.

*Alonso Marco Valle, Philipp Hennig, Jeanette Bohg, Stefan Schaal, Sebastian Trimpe*

**Automatic LQR Tuning Based on Gaussian Process Global Optimization**

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp.: 270 – 277, 2016

www | pdf | BibTeX | Video | Project Page | Project Page II

This paper proposes an automatic controller tuning framework based on linear optimal control combined with Bayesian optimization. With this framework, an initial set of controller gains is automatically improved according to a pre-defined performance objective evaluated from experimental data. The underlying Bayesian optimization algorithm is Entropy Search, which represents the latent objective as a Gaussian process and constructs an explicit belief over the location of the objective minimum. This is used to maximize the information gain from each experimental evaluation. Thus, this framework shall yield improved controllers with fewer evaluations compared to alternative approaches. A seven-degree-of-freedom robot arm balancing an inverted pole is used as the experimental demonstrator. Results of two- and four-dimensional tuning problems highlight the method's potential for automatic controller tuning on robotic platforms.

*Edgar Klenske, Melanie Zeilinger, Bernhard Schölkopf, Philipp Hennig*

**Gaussian Process-Based Predictive Control for Periodic Error Correction****

IEEE Transactions on Control Systems Technology, 24(1):110-121, 2016

www | pdf | BibTeX | Project Page

Many controlled systems suffer from unmodeled nonlinear effects that recur periodically over time. Model-free controllers generally cannot compensate these effects, and good physical models for such periodic dynamics are challenging to construct. We investigate nonparametric system identification for periodically recurring nonlinear effects. Within a Gaussian process (GP) regression framework, we use a locally periodic covariance function to shape the hypothesis space, which allows for a structured extrapolation that is not possible with more widely used covariance functions. We show that hyperparameter estimation can be performed online using the maximum a posteriori point estimate, which provides an accuracy comparable with sampling methods as soon as enough data to cover the periodic structure has been collected. It is also shown how the periodic structure can be exploited in the hyperparameter optimization. The predictions obtained from the GP model are then used in a model predictive control framework to correct the external effect. The availability of good continuous predictions allows control at a higher rate than that of the measurements. We show that the proposed approach is particularly beneficial for sampling times that are smaller than, but of the same order of magnitude as, the period length of the external effect. In experiments on a physical system, an electrically actuated telescope mount, this approach achieves a reduction of about 20% in root mean square tracking error.

*Edgar Klenske, Philipp Hennig*

**Dual Control for Approximate Bayesian Reinforcement Learning**

Journal of Machine Learning Research (JMLR), 17(127):1-30, 2016

www | pdf | BibTeX | Project Page

Control of non-episodic, finite-horizon dynamical systems with uncertain dynamics poses a tough and elementary case of the exploration-exploitation trade-off. Bayesian reinforcement learning, reasoning about the effect of actions and future observations, offers a principled solution, but is intractable. We review, then extend an old approximate approach from control theory---where the problem is known as dual control---in the context of modern regression methods, specifically generalized linear regression. Experiments on simulated systems show that this framework offers a useful approximation to the intractable aspects of Bayesian RL, producing structured exploration strategies that differ from standard RL approaches. We provide simple examples for the use of this framework in (approximate) Gaussian process regression and feedforward neural networks for the control of exploration.

Javie González, Zhenwen Dai, Philipp Hennig, Neil Lawrence

**Batch Bayesian Optimization via Local Penalization**

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics *(*AISTATS 2016*)*, 51, pp.: 648 – 657, JMLR Workshop and Conference Proceedings, Eds.: A. Gretton and C.C. Robert, 2016

The popularity of Bayesian optimization methods for efficient exploration of param- eter spaces has lead to a series of papers ap- plying Gaussian processes as surrogates in the optimization of functions. However, most proposed approaches only allow the explo- ration of the parameter space to occur se- quentially. Often, it is desirable to simulta- neously propose batches of parameter values to explore. This is particularly the case when large parallel processing facilities are avail- able. These could either be computational or physical facets of the process being opti- mized. Batch methods, however, require the modeling of the interaction between the dif- ferent evaluations in the batch, which can be expensive in complex scenarios. We investi- gate this issue and propose a highly effective heuristic based on an estimate of the func- tion’s Lipschitz constant that captures the most important aspect of this interaction— local repulsion—at negligible computational overhead. A penalized acquisition functionis used to collect batches of points minimiz- ing the non-parallelizable computational ef- fort. The resulting algorithm compares very well, in run-time, with much more elaborate alternatives.

*Simon Bartels & Philipp Hennig*

**Probabilistic Approximate Least-Squares**

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016), 51, pp.: 676 – 684, JMLR Workshop and Conference Proceedings, Eds.: A. Gretton, C.C. Robert, 2016

www | pdf | BibTeX | Project Page

Least-squares and kernel-ridge / Gaussian process regression are among the foundational algorithms of statistics and machine learning. Famously, the worst-case cost of exact nonparametric regression grows cubically with the data-set size; but a growing number of approximations have been developed that estimate good solutions at lower cost. These algorithms typically return point estimators, without measures of uncertainty. Leveraging recent results casting elementary linear algebra operations as probabilistic inference, we propose a new approximate method for nonparametric least-squares that affords a probabilistic uncertainty estimate over the error between the approximate and exact least-squares solution (this is not the same as the posterior variance of the associated Gaussian process regressor). This allows estimating the error of the least-squares solution on a subset of the data relative to the full-data solution. The uncertainty can be used to control the computational effort invested in the approximation. Our algorithm has linear cost in the data-set size, and a simple formal form, so that it can be implemented with a few lines of code in programming languages with linear algebra functionality.

## 2015

*Alonso Marco Valle, Philipp Hennig, Jeannette Bohg, Stefan Schaal, Sebastian Trimpe*

**Automatic LQR Tuning Based on Gaussian Process Optimization: Early Experimental Results**

Machine Learning in Planning and Control of Robot Motion Workshop at the IEEE/RSJ International Conference on Intelligent Robots and Systems (iROS), 2015

www | pdf | BibTeX | Project Page I | Project Page II

This paper proposes an automatic controller tun- ing framework based on linear optimal control combined with Bayesian optimization. With this framework, an initial set of controller gains is automatically improved according to a pre-defined performance objective evaluated from experimen- tal data. The underlying Bayesian optimization algorithm is Entropy Search, which represents the latent objective as a Gaussian process and constructs an explicit belief over the location of the objective minimum. This is used to maximize the information gain from each experimental evaluation. Thus, this framework shall yield improved controllers with fewer eva- luations compared to alternative approaches. A seven-degree- of-freedom robot arm balancing an inverted pole is used as the experimental demonstrator. Preliminary results of a low- dimensional tuning problem highlight the method’s potential for automatic controller tuning on robotic platforms.

*Philipp Hennig*

**Probabilistic Interpretation of Linear Solvers**

SIAM Journal on Optimization, 25(1):234-260, 2015

www | pdf | BibTeX | Project Page I | Project Page II | Project Page III | DOI

This paper proposes a probabilistic framework for algorithms that iteratively solve unconstrained linear problems $Bx = b$ with positive definite $B$ for $x$. The goal is to replace the point estimates returned by existing methods with a Gaussian posterior belief over the elements of the inverse of $B$, which can be used to estimate errors. Recent probabilistic interpretations of the secant family of quasi-Newton optimization algorithms are extended. Combined with properties of the conjugate gradient algorithm, this leads to uncertainty-calibrated methods with very limited cost overhead over conjugate gradients, a self-contained novel interpretation of the quasi-Newton and conjugate gradient algorithms, and a foundation for new nonlinear optimization methods.

*Elena Sgouritsa, Dominik Janzig, Philipp Hennig, Bernhard Schölkopf *

**Inference of Cause and Effect with Unsupervised Inverse Regression**

In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 38, pp.: 847 – 855, JMLR Workshop and Conference Proceedings, Eds.: G. Lebanon, G. and S.V.N. Vishwanathan, JMLR.org, AISTATS, 2015

www | pdf | BibTeX | Project Page

We address the problem of causal discovery in the two-variable case given a sample from their joint distribution. The proposed method is based on a known assumption that, if X -> Y (X causes Y), the marginal distribution of the cause, P(X), contains no information about the conditional distribution P(Y|X). Consequently, estimating P(Y|X) from P(X) should not be possible. However, estimating P(X|Y) based on P(Y) may be possible. This paper employs this asymmetry to propose CURE, a causal discovery method which decides upon the causal direction by comparing the accuracy of the estimations of P(Y|X) and P(X|Y). To this end, we propose a method for estimating a conditional from samples of the corresponding marginal, which we call unsupervised inverse GP regression. We evaluate CURE on synthetic and real data. On the latter, our method outperforms existing causal inference methods.

*Maren Mahsereci, Philipp Hennig*

**Probabilistic Line Searches for Stochastic Optimization**

In Advances in Neural Information Processing Systems 28, pp.: 181-189, Eds.: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett, Curran Associates, Inc., 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015

www | pdf | BibTeX | Project Page I | Project Page II | Matlab Research Code

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

*Søren Haubergtal, Michael Schober, Matthew Liptrot, Philipp Hennig, Aasa Feragen*

**A Random Riemannian Metric for Probabilistic Shortest-Path Tractography**

In 18th International Conference on Medical Image Computing and Computer Assisted Intervention, 9349, pp.: 597-604, Lecture Notes in Computer Science, MICCAI, 2015

www | pdf | BibTeX | Project Page

Shortest-path tractography (SPT) algorithms solve global optimization problems defined from local distance functions. As diffusion MRI data is inherently noisy, so are the voxelwise tensors from which local distances are derived. We extend Riemannian SPT by modeling the stochasticity of the diffusion tensor as a “random Riemannian metric”, where a geodesic is a distribution over tracts. We approximate this distribution with a Gaussian process and present a probabilistic numerics algorithm for computing the geodesic distribution. We demonstrate SPT improvements on data from the Human Connectome Project.

Philipp Hennig, M.A. Osborne, M. Girolami

**Probabilistic numerics and uncertainty in computations**

Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 471(2179), 2015

www | pdf | BibTeX | Project Page

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. Such uncertainties, arising from the loss of precision induced by numerical calculation with limited time or hardware, are important for much contemporary science and industry. Within applications such as climate science and astrophysics, the need to make decisions on the basis of computations with large and complex data have led to a renewed focus on the management of numerical uncertainty. We describe how several seminal classic numerical methods can be interpreted naturally as probabilistic inference. We then show that the probabilistic view suggests new algorithms that can flexibly be adapted to suit application specifics, while delivering improved empirical performance. We provide concrete illustrations of the benefits of probabilistic numeric algorithms on real scientific problems from astrometry and astronomical imaging, while highlighting open problems with these new algorithms. Finally, we describe how probabilistic numerical methods provide a coherent framework for identifying the uncertainty in calculations performed with a combination of numerical algorithms (e.g. both numerical optimizers and differential equation solvers), potentially allowing the diagnosis (and control) of error sources in computations.

## 2014

*Martin Kiefel, Christian Schuler, Philipp Hennig*

**Probabilistic Progress Bars**

In Conference on Pattern Recognition* (*GCPR*)*, 8753, pp.: 331 – 341, Lecture Notes in Computer Science, Eds.: X.Jiang, J.Hornegger, R.Koch), Springer, GCPR, September 2014

www | pdf | BibTeX | Project Page | Website + Code

Predicting the time at which the integral over a stochastic process reaches a target level is a value of interest in many applications. Often, such computations have to be made at low cost, in real time. As an intuitive example that captures many features of this problem class, we choose progress bars, a ubiquitous element of computer user interfaces. These predictors are usually based on simple point estimators, with no error modelling. This leads to fluctuating behaviour confusing to the user. It also does not provide a distribution prediction (risk values), which are crucial for many other application areas. We construct and empirically evaluate a fast, constant cost algorithm using a Gauss-Markov process model which provides more information to the user.

*Philipp Hennig und Soren Hauberg*

**Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics**

In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 33, pp.: 347 – 355, JMLR: Workshop and Conference Proceedings, Eds.: S. Kaski,J. Corander, Microtome Publishing, Brookline, MA, AISTATS, April 2014

www | pdf | BibTeX | Project Page I | Project Page II | Video | Supplements

We study a probabilistic numerical method for the solution of both boundary and initial value problems that returns a joint Gaussian process posterior over the solution. Such methods have concrete value in the statistics on Riemannian manifolds, where non-analytic ordinary differential equations are involved in virtually all computations. The probabilistic formulation permits marginalising the uncertainty of the numerical solution such that statistics are less sensitive to inaccuracies. This leads to new Riemannian algorithms for mean value computations and principal geodesic analysis. Marginalisation also means results can be less precise than point estimates, enabling a noticeable speed-up over the state of the art. Our approach is an argument for a wider point that uncertainty caused by numerical calculations should be tracked throughout the pipeline of machine learning algorithms.

*Michael Schober, David Duvenau, Philipp Hennig*

**Probabilistic ODE Solvers with Runge-Kutta Means**

In Advances in Neural Information Processing Systems 27, pp.: 739 – 747, Eds.: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger, Curran Associates, Inc., 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014

www | pdf | BibTeX | Project Page I | Project Page II

Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state of the art. Like most numerical meth- ods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribu- tion over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus in- heriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.

*Roman Garnett, Michael A. Osborne, Philipp Hennig*

**Active Learning of Linear Embeddings for Gaussian Processes**

In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp.: 230 – 239, Eds.: N.L. Zhang and J. Tian, AUAI Press , Corvallis, Oregon, UAI2014, 2014

www | pdf | BibTeX | Project Page

We propose an active learning method for discovering low-dimensional structure in high- dimensional Gaussian process (GP) tasks. Such problems are increasingly frequent and impor- tant, but have hitherto presented severe practical difficulties. We further introduce a novel tech- nique for approximately marginalizing GP hyper- parameters, yielding marginal predictions robust to hyperparameter misspecification. Our method offers an efficient means of performing GP re- gression, quadrature, or Bayesian optimization in high-dimensional spaces.

*Michael Schober, Niklas Kasenburg, Aasa Feragen, Philipp Hennig, Soren Hauberg*

**Probabilistic Shortest Path Tractography in DTI Using Gaussian Process ODE Solvers**

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014, Lecture Notes in Computer Science, 8675 (2014), pp: 265 – 272, eds.: P. Golland, N. Hata, C. Barillot, J. Hornegger and R. Howe, Springer, Heidelberg, MICCAI, 2014

www | pdf | BibTeX | Project Page I | Project Page II

Tractography in diffusion tensor imaging estimates connectivity in the brain through observations of local diffusivity. These observations are noisy and of low resolution and, as a consequence, connections cannot be found with high precision. We use probabilistic numerics to estimate connectivity between regions of interest and contribute a Gaussian Process tractography algorithm which allows for both quantification and visualization of its posterior uncertainty. We use the uncertainty both in visualization of individual tracts as well as in heat maps of tract locations. Finally, we provide a quantitative evaluation of different metrics and algorithms showing that the adjoint metric [8] combined with our algorithm produces paths which agree most often with experts.

*Tom Gunter, Michael A. Osborne, Roman Garnett, Philipp Hennig, Stephen J. Roberts*

**Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature**

In Advances in Neural Information Processing Systems 27, pp.: 2789 – 2797, eds.: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger, Curran Associates, Inc., 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014

www | pdf | BibTeX | Project Page

We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in proba- bilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute the marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive com- putation costs. We introduce a warped model for probabilistic integrands (like- lihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.

*Franziska Meier, Philipp Hennig, Stefan Schaal*

**Incremental Local Gaussian Regression**

In Advances in Neural Information Processing Systems 27, pp.: 972-980, eds.: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger, 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014, clmc

www | pdf | BibTeX | Project Page I | Project Page II

Locally weighted regression (LWR) was created as a nonparametric method that can approximate a wide range of functions, is computationally efficient, and can learn continually from very large amounts of incrementally collected data. As an interesting feature, LWR can regress on non-stationary functions, a beneficial property, for instance, in control problems. However, it does not provide a proper generative model for function values, and existing algorithms have a variety of manual tuning parameters that strongly influence bias, variance and learning speed of the results. Gaussian (process) regression, on the other hand, does provide a generative model with rather black-box automatic parameter tuning, but it has higher computational cost, especially for big data sets and if a non-stationary model is required. In this paper, we suggest a path from Gaussian (process) regression to locally weighted regression, where we retain the best of both approaches. Using a localizing function basis and approximate inference techniques, we build a Gaussian (process) regression algorithm of increasingly local nature and similar computational complexity to LWR. Empirical evaluations are performed on several synthetic and real robot datasets of increasing complexity and (big) data scale, and demonstrate that we consistently achieve on par or superior performance compared to current state-of-the-art methods while retaining a principled approach to fast incremental regression with minimal manual tuning parameters.

*Franziska Meier, Philipp Hennig, Stefan Schaal *

**Efficient Bayesian Local Model Learning for Control**

In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, pp.: 2244 - 2249, IROS, 2014, clmc

www | pdf | BibTeX | Project Page I | Project Page II

Model-based control is essential for compliant controland force control in many modern complex robots, like humanoidor disaster robots. Due to many unknown and hard tomodel nonlinearities, analytical models of such robots are oftenonly very rough approximations. However, modern optimizationcontrollers frequently depend on reasonably accurate models,and degrade greatly in robustness and performance if modelerrors are too large. For a long time, machine learning hasbeen expected to provide automatic empirical model synthesis,yet so far, research has only generated feasibility studies butno learning algorithms that run reliably on complex robots.In this paper, we combine two promising worlds of regressiontechniques to generate a more powerful regression learningsystem. On the one hand, locally weighted regression techniquesare computationally efficient, but hard to tune due to avariety of data dependent meta-parameters. On the other hand,Bayesian regression has rather automatic and robust methods toset learning parameters, but becomes quickly computationallyinfeasible for big and high-dimensional data sets. By reducingthe complexity of Bayesian regression in the spirit of local modellearning through variational approximations, we arrive at anovel algorithm that is computationally efficient and easy toinitialize for robust learning. Evaluations on several datasetsdemonstrate very good learning performance and the potentialfor a general regression learning tool for robotics.

## 2013

*Michael Schober *

**Camera-specific Image Denoising**

Eberhard Karls Universität Tübingen, Germany, October 2013 (diplomathesis)

Images captured with digital cameras are corrupted by noise due to the properties of the physical measurement process. Traditional image denoising algorithms have been based either solely on image statistics or solely on noise statistics, but not both at the same time. Secondly, the connections between general purpose image denoising algorithms and camera-specific denoising algorithms have not been investigated. In this work, image denoising with realistic camera noise will be examined, specifically for application in astro-photography. It will be shown that high-quality images can be obtained through careful calibration and residual noise is significantly less than most commonly assumed. Additionally, a new machine learning based approach to camera-specific image denoising will be presented.

*Philipp Hennig & Martin Kiefel*

**Quasi-Newton Methods: A New Direction**

Journal of Machine Learning Research, 14(1):843-865, March 2013

Website + Code | pdf | BibTeX | Project Page I | Project Page II

Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors.

*David Lopez - Paz, Philipp Hennig, Bernhard Schölkopf*

**The Randomized Dependence Coefficient**

In Advances in Neural Information Processing Systems 26, pp.: 1 – 9, eds.: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013

We introduce the Randomized Dependence Coefficient (RDC), a measure of non- linear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-Re ́nyi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper.

*Philipp Hennig*

**Fast Probabilistic Optimization from Noisy Gradients**

In Proceedings of The 30th International Conference on Machine Learning*, *JMLR W&CP 28(1), pp.: 62–70, eds.: S. Dasgupta and D. McAllester, ICML, 2013

www | pdf | BibTeX | Project Page I | Project Page II

Stochastic gradient descent remains popular in large-scale machine learning, on account of its very low computational cost and robust- ness to noise. However, gradient descent is only linearly efficient and not transformation invariant. Scaling by a local measure can sub- stantially improve its performance. One natu- ral choice of such a scale is the Hessian of the objective function: Were it available, it would turn linearly efficient gradient descent into the quadratically efficient Newton-Raphson optimization. Existing covariant methods, though, are either super-linearly expensive or do not address noise. Generalising recent results, this paper constructs a nonparametric Bayesian quasi-Newton algorithm that learns gradient and Hessian from noisy evaluations of the gradient. Importantly, the resulting al- gorithm, like stochastic gradient descent, has cost linear in the number of input dimensions.

*Edgar Klenske, Melanie Zeilinger, Bernhard Schölkopf, Philipp Hennig*

**Nonparametric dynamics estimation for time periodic systems**

In Proceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing, pp.: 486 – 493 , 2013

www | pdf | BibTeX | Project Page I | Project Page II

Screws and gears are a source of periodically recurring nonlinear effects in mechanical dynamical systems. Unless the state sampling frequency is much higher than the periodic effect, model-free controllers cannot always compensate these effects, and good physical models for such periodic dynamics are challenging to construct. We investigate nonparametric system identification with an explicit focus on periodically recurring nonlinear effects. Within a Gaussian process regression framework, we design a locally periodic covariance function to shape the hypothesis space, which allows for a structured extrapolation that is not possible with more widely used covariance functions. These predictions are then used in model predictive control to construct a control signal correcting for the predicted external effect. We show that this approach is beneficial for state sampling times that are smaller than, but comparable to, the period length of the external effect. In experiments on a physical system, an electrically actuated telescope mount, this approach achieves a reduction of about 20% in root mean square error.

*Mark Bangert, Philipp Hennig, Uwe Oelfke*

**Analytical probabilistic modeling for radiation therapy treatment planning**

Physics in Medicine and Biology, 58(16):5401-5419, 2013

www | pdf | BibTeX | Project Page I | Project Page II

This paper introduces the concept of analytical probabilistic modeling (APM) to quantify uncertainties in quality indicators of radiation therapy treatment plans. Assuming Gaussian probability densities over the input parameters of the treatment plan quality indicators, APM enables the calculation of the moments of the induced probability density over the treatment plan quality indicators by analytical integration. This paper focuses on analytical probabilistic dose calculation algorithms and the implications of APM regarding treatment planning. We derive closed-form expressions for the expectation value and the (co)variance of (1) intensity-modulated photon and proton dose distributions based on a pencil beam algorithm and (2) the standard quadratic objective function used in inverse planning. Complex correlation models of high dimensional uncertain input parameters and the different nature of random and systematic uncertainties in fractionated radiation therapy are explicitly incorporated into APM. APM variance calculations on phantom data sets show that the correlation assumptions and the difference of random and systematic uncertainties of the input parameters have a crucial impact on the uncertainty of the resulting dose. The derivations regarding the quadratic objective function show that APM has the potential to enable robust planning at almost the same computational cost like conventional inverse planning after a single probabilistic dose calculation. Beneficial applications of APM in the context of radiation therapy treatment planning are feasible.

*Mark Bangert, Philipp Henig, Uwe Oelfke*

**Analytical probabilistic proton dose calculation and range uncertainties**

In 17th International Conference on the Use of Computers in Radiation Therapy, pp.: 6 – 11, eds.: A. Haworth and T. Kron, ICCR, 2013

BibTeX | Project Page I | Project Page II

*Philipp Hennig*

**Animating Samples from Gaussian Distributions**

Max Planck Institute for Intelligent Systems, Tübingen, Germany, 2013

## 2012

*Philipp Hennig & Martin Kiefel*

**Quasi-Newton Methods: A New Direction**

In Proceedings of the 29th International Conference on Machine Learning, pp.: 25 – 32, ICML ’12, eds.: John Langford and Joelle Pineau, Omnipress, NY, USA, ICML, July 2012

www | pdf | BibTeX | Project Page I | Project Page II

Four decades after their invention, quasi- Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors.

*Philipp Hennig & Christian J. Schuler*

**Entropy Search for Information-Efficient Global Optimization**

Journal of Machine Learning Research, 13, pp.: 1809 – 1837, June 2012

www | pdf | BibTeX | Project Page I | Project Page II

Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly adresses the decision problem of maximizing information gain from each evaluation.

*Botond Bocsi, Philipp Hennig, Lehel Csato, Jan Peters*

**Learning Tracking Control with Forward Models**

IEEE International Conference on Robotics and Automation (ICRA), pp.: 259 – 264, May 2012

Performing task-space tracking control on redundant robot manipulators is a difficult problem. When the physical model of the robot is too complex or not available, standard methods fail and machine learning algorithms can have advantages. We propose an adaptive learning algorithm for tracking control of underactuated or non-rigid robots where the physical model of the robot is unavailable. The control method is based on the fact that forward models are relatively straightforward to learn and local inversions can be obtained via local optimization. We use sparse online Gaussian process inference to obtain a flexible probabilistic forward model and second order optimization to find the inverse mapping. Physical experiments indicate that this approach can outperform state-of-the-art tracking control algorithms in this context.

*John P. Cunningham, Philipp Hennig, Simon Lacoste-Julien*

**Approximate Gaussian Integration using Expectation Propagation**

pp.: 1-11, January 2012

While Gaussian probability densities are omnipresent in applied mathematics, Gaussian cumulative probabilities are hard to calculate in any but the univariate case. We offer here an empirical study of the utility of Expectation Propagation (EP) as an approximate integration method for this problem. For rectangular integration regions, the approximation is highly accurate. We also extend the derivations to the more general case of polyhedral integration regions. However, we find that in this polyhedral case, EP's answer, though often accurate, can be almost arbitrarily wrong. These unexpected results elucidate an interesting and non-obvious feature of EP not yet studied in detail, both for the problem of Gaussian probabilities and for EP more generally.

*Philipp Hennig, David Stern, Ralf Herbrich, Thore Graepel*

**Kernel Topic Models**

In Fifteenth International Conference on Artificial Intelligence and Statistics, 22, pp.: 511 – 519, JMLR Proceedings, eds.: N.D. Lawrence, N. D. and M. Girolami, JMLR.org, AISTATS , 2012

www | pdf | BibTeX | Project Page

Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

*Edgar Klenske*

**Nonparametric System Identification and Control for Periodic Error Correction in Telescopes**

University of Stuttgart, 2012 (diplomathesis)

High quality photographs of astronomical objects require a photographic device to remain steady relative to the sky. Because Earth rotates, the telescope has to follow a circular trajectory as precisely as possible. The machinery built for this purpose is never absolutely perfect: Imprecisions in the worm drives of standard telescope mounts cause a periodic error in the pointing direction, resulting in image blur. Because the dynamics that lead to this error are often not known in advance, they cannot be modelled explicitly. Therefore, this problem is normally addressed with a model-free controller. In this work, Gaussian process regression with a periodic kernel is employed to identify the unknown dynamical system. As a nonparametric method, this regressor can learn even complicated nonlinearities and does not require the user to provide a parametric model. Experiments in simulation and hardware show improvements over standard approaches to autoguiding and periodic error correction. Beyond the application to telescopes, this study serves as an example for the utility of Gaussian processes in providing a flexible modelling class for various types of hardware.

## 2011

*Philipp Hennig*

**Optimal Reinforcement Learning for Gaussian System**s

In Advances in Neural Information Processing Systems 24, pp.: 325 – 333, eds.: J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F. Pereira and K.Q. Weinberger, Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS), 2011

www | pdf | BibTeX | Project Page I | Project Page II

The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A first order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an infinite-dimensional partial differential equation. An approximate finitedimensional projection gives an impression for how this result may be helpful.

## 2010

*Mark Bangert, Philipp Hennig, Uwe Oelfke*

**Using an Infinite Von Mises-Fisher Mixture Model to Cluster Treatment Beam Directions in External Radiation Therapy**

pp.: 74 – 751 , eds.: S. Draghici, T.M. Khoshgoftaar, V. Palade, W. Pedrycz, M.A. Wani and X. Zhu, IEEE, Piscataway, NJ, USA, Ninth International Conference on Machine Learning and Applications (ICMLA), December 2010

We present a method for fully automated selection of treatment beam ensembles for external radiation therapy. We reformulate the beam angle selection problem as a clustering problem of locally ideal beam orientations distributed on the unit sphere. For this purpose we construct an infinite mixture of von Mises-Fisher distributions, which is suited in general for density estimation from data on the D-dimensional sphere. Using a nonparametric Dirichlet process prior, our model infers probability distributions over both the number of clusters and their parameter values. We describe an efficient Markov chain Monte Carlo inference algorithm for posterior inference from experimental data in this model. The performance of the suggested beam angle selection framework is illustrated for one intra-cranial, pancreas, and prostate case each. The infinite von Mises-Fisher mixture model (iMFMM) creates between 18 and 32 clusters, depending on the patient anatomy. This suggests to use the iMFMM directly for beam ensemble selection in robotic radio surgery, or to generate low-dimensional input for both subsequent optimization of trajectories for arc therapy and beam ensemble selection for conventional radiation therapy.

*Philipp Hennig*

**Approximate Inference in Graphical Model**s

University of Cambridge, November 2010

Probability theory provides a mathematically rigorous yet conceptually flexible calculus of uncertainty, allowing the construction of complex hierarchical models for real-world inference tasks. Unfortunately, exact inference in probabilistic mod- els is often computationally expensive or even intractable. A close inspection in such situations often reveals that computational bottlenecks are confined to cer- tain aspects of the model, which can be circumvented by approximations without having to sacrifice the model’s interesting aspects. The conceptual framework of graphical models provides an elegant means of representing probabilistic models and deriving both exact and approximate inference algorithms in terms of local computations. This makes graphical models an ideal aid in the development of generalizable approximations. This thesis contains a brief introduction to approx- imate inference in graphical models (Chapter 2), followed by three extensive case studies in which approximate inference algorithms are developed for challenging applied inference problems. Chapter 3 derives the first probabilistic game tree search algorithm. Chapter 4 provides a novel expressive model for inference in psychometric questionnaires. Chapter 5 develops a model for the topics of large corpora of text documents, conditional on document metadata, with a focus on computational speed. In each case, graphical models help in two important ways: They first provide important structural insight into the problem; and then suggest practical approximations to the exact probabilistic solution.

*Philipp Hennig, David Stern, Thore Graepel*

**Coherent Inference on Optimal Play in Game Trees**

In JMLR Workshop and Conference Proceedings Vol. 9: AISTATS 2010, pp.: 326 – 333, eds.: Y.W. Teh and M. Titterington, JMLR, Cambridge, MA, USA, Thirteenth International Conference on Artificial Intelligence and Statistics, May 2010

Round-based games are an instance of discrete planning problems. Some of the best contemporary game tree search algorithms use random roll-outs as data. Relying on a good policy, they learn on-policy values by propagating information upwards in the tree, but not between sibling nodes. Here, we present a generative model and a corresponding approximate message passing scheme for inference on the optimal, off-policy value of nodes in smooth AND/OR trees, given random roll-outs. The crucial insight is that the distribution of values in game trees is not completely arbitrary. We define a generative model of the on-policy values using a latent score for each state, representing the value under the random roll-out policy. Inference on the values under the optimal policy separates into an inductive, pre-data step and a deductive, post-data part. Both can be solved approximately with Expectation Propagation, allowing off-policy value inference for any node in the (exponentially big) tree in linear time.

## 2009

*Philipp Hennig, David Stern, Thore Graepel*

**Bayesian Quadratic Reinforcement Learning**

NIPS Workshop on Probabilistic Approaches for Robotics and Control, December 2009

The idea of using “optimistic” evaluations to guide exploration has been around for a long time, but has received renewed interest recently thanks to the development of bound-based guarantees [Kolter and Ng, 2009]. Asmuth et al. [2009] have recently presented a way to construct optimistic evaluations by merging sampled MDPs. Their BOSS algorithm is polynomial in the size of the state-action-space of the merged MDP, and thus a multiple of the size of the actual state space.

The success of BOSS raises the question whether it might be possible to avoid the sampling step and instead use an approximate parametrization of the belief over values. To test this idea, we construct an approximate Gaussian (i.e. log quadratic) joint posterior over the values of states under a given policy. Our derivation ignores the effect of future expected experience (in contrast to a fully Bayesian treatment as in the BEETLE algorithm [Poupart et al., 2006]).

*Philipp Hennig*

**Expectation Propagation on the Maximum of Correlated Normal Variables**

Cavendish Laboratory: University of Cambridge, July 2009

Many inference problems involving questions of optimality ask for the maximum or the minimum of a finite set of unknown quantities. This technical report derives the first two posterior moments of the maximum of two correlated Gaussian variables and the first two posterior moments of the two generating variables (corresponding to Gaussian approximations minimizing relative entropy). It is shown how this can be used to build a heuristic approximation to the maximum relationship over a finite set of Gaussian variables, allowing approximate inference by Expectation Propagation on such quantities.

## 2007

*Philipp Hennig & Winfried Denk*

**Point-spread functions for backscattered imaging in the scanning electron microscope**

Journal of Applied Physics* *, 102(12):1-8, December 2007

One knows the imaging system's properties are central to the correct interpretation of any image. In a scanning electron microscope regions of different composition generally interact in a highly nonlinear way during signal generation. Using Monte Carlo simulations we found that in resin-embedded, heavy metal-stained biological specimens staining is sufficiently dilute to allow an approximately linear treatment. We then mapped point-spread functions for backscattered-electron contrast, for primary energies of 3 and 7 keV and for different detector specifications. The point-spread functions are surprisingly well confined (both laterally and in depth) compared even to the distribution of only those scattered electrons that leave the sample again.

# Tech reports

*Krikamol Muandet, Motonobu Kanagawa, Sorawit Saengkyongam, Sanparith Marukatat*

**Counterfactual Mean Embedding: A Kernel Method for Nonparametric Causal Inference**

2018

arXiv: 1805.08845v1 | pdf | BibTeX

This paper introduces a novel Hilbert space representation of a counterfactual distribution---called counterfactual mean embedding (CME)---with applications in nonparametric causal inference. Counterfactual prediction has become an ubiquitous tool in machine learning applications, such as online advertisement, recommendation systems, and medical diagnosis, whose performance relies on certain interventions. To infer the outcomes of such interventions, we propose to embed the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel. Under appropriate assumptions, the CME allows us to perform causal inference over the entire landscape of the counterfactual distribution. The CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also derive a rate of convergence which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Our framework can deal with not only real-valued outcome, but potentially also more complex and structured outcomes such as images, sequences, and graphs. Lastly, our experimental results on off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

*Damien Garreau, Wittawat Jitkrittum, Motonobu Kanagawa*

**Large sample analysis of the median heuristic**

arXiv:1707.07269 [math.ST] | pdf

2018

In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel two-sample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel two-sample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.

*Takafumi Kajihara, Motonobu Kanagawa, Yuuki Nakaguchi, Kanishka Khandelwal, Kenji Fukumiziu*

**Model Selection for Simulator-based Statistical Models: A Kernel Approach**

2019

We propose a novel approach to model selection for simulator-based statistical models. The proposed approach defines a mixture of candidate models, and then iteratively updates the weight coefficients for those models as well as the parameters in each model simultaneously; this is done by recursively applying Bayes' rule, using the recently proposed kernel recursive ABC algorithm. The practical advantage of the method is that it can be used even when a modeler lacks appropriate prior knowledge about the parameters in each model. We demonstrate the effectiveness of the proposed approach with a number of experiments, including model selection for dynamical systems in ecology and epidemiology.

*Alonso Marco, Dominik Baumann, Philipp Hennig, Sebastian Trimpe*

**Classified Regression for Bayesian Optimization: Robot Learning with Unknown Penalties**

2019

Learning robot controllers by minimizing a black-box objective cost using Bayesian optimization (BO) can be time-consuming and challenging. It is very often the case that some roll-outs result in failure behaviors, causing premature experiment detention. In such cases, the designer is forced to decide on heuristic cost penalties because the acquired data is often scarce, or not comparable with that of the stable policies. To overcome this, we propose a Bayesian model that captures exactly what we know about the cost of unstable controllers prior to data collection: Nothing, except that it should be a somewhat large number. The resulting Bayesian model, approximated with a Gaussian process, predicts high cost values in regions where failures are likely to occur. In this way, the model guides the BO exploration toward regions of stability. We demonstrate the benefits of the proposed model in several illustrative and statistical synthetic benchmarks, and also in experiments on a real robotic platform. In addition, we propose and experimentally validate a new BO method to account for unknown constraints. Such method is an extension of Max-Value Entropy Search, a recent information-theoretic method, to solve unconstrained global optimization problems.

*Maren Mahsereci, Lukas Balles, Christoph Lassner, Philipp Hennig*

**Early Stopping Without a Validation Set (PREPRINT)**

2017

arXiv preprint: 1703.09580 | pdf | BibTeX

Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. In this paper we propose a novel early stopping criterion which is based on fast-to-compute, local statistics of the computed gradients and entirely removes the need for a held-out validation set. Our experiments show that this is a viable approach in the setting of least-squares and logistic regression as well as neural networks.

*Filip de Roos and Philipp Hennig*

**Krylov Subspace Recycling for Fast Iterative Least-Squares in Machine Learning**

2017

arXiv preprint: 1706.00241 | pdf | BibTeX

Solving symmetric positive definite linear problems is a fundamental computational task in machine learning. The exact solution, famously, is cubicly expensive in the size of the matrix. To alleviate this problem, several linear-time approximations, such as spectral and inducing-point methods, have been suggested and are now in wide use. These are low-rank approximations that choose the low-rank space a priori and do not refine it over time. While this allows linear cost in the data-set size, it also causes a finite, uncorrected approximation error. Authors from numerical linear algebra have explored ways to iteratively refine such low-rank approximations, at a cost of a small number of matrix-vector multiplications. This idea is particularly interesting in the many situations in machine learning where one has to solve a sequence of related symmetric positive definite linear problems. From the machine learning perspective, such deflation methods can be interpreted as transfer learning of a low-rank approximation across a time-series of numerical tasks. We study the use of such methods for our field. Our empirical results show that, on regression and classification problems of intermediate size, this approach can interpolate between low computational cost and numerical precision.