- This seminar includes a total of 9 talks by machine-learning experts from all around the world.
- Each talk is 30 minutes. All talks will be held in the open area at the RIKEN-AIP Center in Nihombashi, Tokyo.
- The event is open to public, but in case too many people join, AIP people are given priority. Register here.
- There will be one coffee break. We will provide coffee, but no snacks/food.

- 08:30-8:55 Registration/setting-up
**Session I**(Chair: Emti)-
- 09:00 Hung Bui (VinAI Research),
*Amortized Inference Regularization*[Abstract] - 09:30 Emtiyaz Khan (RIKEN-AIP)
*Learning-Algorithms from Bayesian Principle*[Abstract] - 10:00 Arno Solin (U Aalto, Finland),
*Scaling GPs for Real-Time Inference: Infinite-Horizon Gaussian Processes*[Abstract] - 10:30 James Foulds (U Maryland),
*Bayesian Modeling of Intersectional Fairness: The Variance of Bias*[Abstract]

- 09:00 Hung Bui (VinAI Research),
- 11:00-11:15 Coffee/Tea
**Session II**(Chair: TBD)-
- 11:15 Zoltan Szabo (École Polytechnique, France),
*Towards Outlier-Robust Statistical Inference on Kernel-Enriched Domains*[Abstract] - 11:45 Arthur Gretton (Gatsby Unit, UK),
*Advances in kernel exponential families*[Abstract] - 12:15 Mark Schmidt (UBC, Vancouver),
*“Active-set Complexity” of Proximal Gradient: How Long Does It Take to Find the Sparsity Pattern?*[Abstract] - 12:45 Matt Staib (MIT),
*Stochastic and Robust Submodular Optimization*[Abstract] - 13:15 Ling Chen (UT Sydney),
*Graph Embedding: From Learning to Hashing*[Abstract]

- 11:15 Zoltan Szabo (École Polytechnique, France),
- 13:45 Concluding Remarks

Title: Scaling GPs for Real-Time Inference: Infinite-Horizon Gaussian Processes

Abstract: Gaussian processes provide a flexible framework for forecasting, removing noise, and interpreting long temporal datasets. State space modelling (Kalman filtering) enables these non-parametric models to be deployed on long datasets by reducing the complexity to linear in the number of data points. The complexity is still cubic in the state dimension m which is an impediment to practical application. In certain special cases (Gaussian likelihood, regular spacing) the GP posterior will reach a steady posterior state when the data are very long. By leveraging this, we can formulate an inference scheme for GPs with general likelihoods, where inference is based on single-sweep EP (assumed density filtering). The infinite-horizon model tackles the cubic cost in the state dimensionality and reduces the cost in the state dimension m to O(m^2) per data point. This model can be applied to large finite-length modelling problems, or run in real-time on a smartphone on a continuous data stream updated at hundreds of Hz.

Title: Advances in Kernel Exponential Family

Abstract: We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite-dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. Our approach improves the computational efficiency of an earlier solution by using a low-rank, Nystrom-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees.

Title: Learning-Algorithms from Bayesian Principle

Abstract: In machine learning, new learning algorithms are designed by borrowing ideas from optimization and statistics followed by an extensive empirical efforts to make them practical. However, there is a lack of underlying principles to guide this process. I will present a stochastic learning algorithm derived from Bayesian principle. Using this algorithm, we can obtain a range of existing algorithms: from classical methods such as least-squares, Newton's method, and Kalman filter to new deep-learning algorithms such as RMSprop and Adam. Surprisingly, using the same principles, new algorithms can be naturally obtained even for the challenging learning tasks such as online learning, continual learning, and reinforcement learning. This talk will summarize recent works and outline future directions on how this principle can be used to make algorithms that mimic the learning behaviour of living beings.

Title: Amortized Inference Regularization

Abstract: The variational autoencoder (VAE) is a popular model for density estimation and representation learning. Canonically, the variational principle suggests to prefer an expressive inference model so that the variational approximation is accurate. However, it is often overlooked that an overly-expressive inference model can be detrimental to the test set performance of both the amortized posterior approximator and, more importantly, the generative density estimator. In this paper, we leverage the fact that VAEs rely on amortized inference and propose techniques for amortized inference regularization (AIR) that control the smoothness of the inference model. We demonstrate that, by applying AIR, it is possible to improve VAE generalization on both inference and generative performance. Our paper challenges the belief that amortized inference is simply a mechanism for approximating maximum likelihood training and illustrates that regularization of the amortization family provides a new direction for understanding and improving generalization in VAEs. (available at https://arxiv.org/abs/1805.08913)

Title: Bayesian Modeling of Intersectional Fairness: The Variance of Bias

Abstract: With the rising influence of machine learning algorithms on many important aspects of our daily lives, there are growing concerns that biases inherent in data can lead the behavior of these algorithms to discriminate against certain populations. Informed by the framework of intersectionality from the Humanities literature, we propose mathematical definitions of AI fairness that aim to ensure protection along overlapping dimensions including gender, race, sexual orientation, class, and disability. We prove that our fairness criteria behave sensibly for any subset of the set of protected attributes, and we illustrate links to differential privacy. Finally, we present a Bayesian probabilistic modeling approach for the reliable, data-efficient estimation of fairness with multi-dimensional protected attributes. Experimental results on criminal justice, census, and synthetic data demonstrate the utility of our methodology, and show that Bayesian methods are valuable for the modeling and measurement of fairness in an intersectional context.

Title: Graph Embedding: From Learning to Hashing

Abstract: The surge of real-world graph data, such as chemical compounds, networks and social communities, has led to the rise of graph mining research. However, due to the arbitrary structure of graph data, it is not easy to compute graph similarities. The essential challenge of graph mining is to represent graph data in a vector space that facilitates downstream data mining and machine learning tasks. Therefore, graph representation learning has received a great amount of attention recently. While most of the existing graph representation research focuses on learning the vector representation for graph data, this talk introduces our recent works that use randomised hashing to map graphs to vectors of a fixed number of dimensions. We have developed efficient hashing algorithms for both node embedding and graph embedding. Our experimental results demonstrate that our hashing algorithms serve as the most efficient graph feature extraction method. The generated hashing codes lead to no observable performance loss in applications such as graph classification.

Title: “Active-set Complexity” of Proximal Gradient: How Long Does It Take to Find the Sparsity Pattern?

Abstract: Proximal gradient methods have been found to be highly effective for solving minimization problems with non-negative constraints or $ell_1$-regularization. Under suitable nondegeneracy conditions, it is known that these algorithms identify the optimal sparsity pattern for these types of problems in a finite number of iterations. However, it is not known how many iterations this may take. We introduce the notion of the "active-set complexity", which in these cases is the number of iterations before an algorithm is guaranteed to have identified the final sparsity pattern. We further give a bound on the active-set complexity of proximal gradient methods in the common case of minimizing the sum of a strongly-convex smooth function and a separable convex non-smooth function.

Title: Stochastic and Robust Submodular Optimization

Abstract: Submodular functions have applications throughout machine learning, but in many settings, we do not have direct access to the underlying function f. We focus on stochastic functions that are given as an expectation of functions over a distribution P. In practice, we often have only a limited set of samples f_i from P. The standard approach indirectly optimizes f by maximizing the sum of f_i. However, this ignores generalization to the true (unknown) distribution. Instead, we achieve better performance on the actual underlying function f by directly optimizing a combination of bias and variance. Algorithmically, we accomplish this by showing how to carry out distributionally robust optimization (DRO) for submodular functions, providing efficient algorithms backed by theoretical guarantees which leverage novel contributions to the general theory of DRO. We also show compelling empirical evidence that DRO improves generalization to the unknown stochastic submodular function.

Title: Towards Outlier-Robust Statistical Inference on Kernel-Enriched Domains

Abstract: Mean embedding and maximum mean discrepancy (MMD) form the basis of modern statistical inference techniques on kernel-endowed domains with a wide variety of successful applications. In several examples the underlying kernel used is unbounded (examples include polynomial, exponential, string or graph kernels) in which case even a single outlier can completely ruin the available mean embedding and MMD estimators. In this talk, I will (i) focus on how to design systematically outlier-robust mean embedding and MMD estimators relying on the recently emerged principle of median-of-means, (ii) detail the consistency and excessive outlier-robustness of the constructed estimators, and (iii) illustrate their efficiency in discrimination of DNA subsequences. [joint work with Matthieu Lerasle, Timothee Mathieu and Guillaume Lecue]

Preprint: https://arxiv.org/abs/1802.04784