UMass CICS Machine Learning Retrospective (2021)

University of Massachusetts

Manning College of Information and Computer Sciences

Machine Learning Retrospective, 2021

With 2021 drawing to a close, we would like to take a moment to recognize the wealth of machine learning research produced by the UMass Manning College of Information and Computer Sciences (CICS). This retrospective provides a brief summary of many (not all) of the machine learning papers published by students and/or faculty in CICS. You can browse papers by their name in the index below, or can just scroll through to get a sense for all of the work that we are doing!

Index

[AISTATS 2021] RealMVP: A Change of Variables Method For Rectangular Matrix-Vector Products.

[ICML 2021] High Confidence Generalization for Reinforcement Learning.

[ICML 2021] On the Difficulty of Unbiased Alpha Divergence Minimization.

[ICML 2021] Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods.

[ICML 2021] Towards Practical Mean Bounds for Small Samples.

[ICML 2021] How and Why to Use Experimental Data to Evaluate Methods for Observational Causal Inference.

[ICML 2021] DeepWalking Backwards: From Node Embeddings Back to Graphs.

[ICML 2021] Faster Kernel Matrix Algebra via Density Estimation.

[NeurIPS 2021] Structural Credit Assignment in Neural Networks using Reinforcement Learning.

[NeurIPS 2021] Universal Off-Policy Evaluation.

[NeurIPS 2021] SOPE: Spectrum of Off-Policy Estimators.

[NeurIPS 2021] MCMC Variational Inference via Uncorrected Hamiltonian Annealing.

[NeurIPS 2021] Amortized Variational Inference for Simple Hierarchical Models.

[NeurIPS 2021] Relaxed Marginal Consistency for Differentially Private Query Answering.

[NeurIPS 2021] Pareto-Optimal Learning-Augmented Algorithms for Online Conversion Problems.

[NeurIPS 2021] Cooperative Stochastic Bandits with Asynchronous Agents and Constrained Feedback.

[NeurIPS 2021] Turing Completeness of Bounded-Precision Recurrent Neural Networks.

[NeurIPS 2021] MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents.

[NeurIPS 2021] Coresets for Classification – Simplified and Strengthened.

RealMVP: A Change of Variables Method For Rectangular Matrix-Vector Products

By Edmond Cunningham and Madalina Fiterau

Rectangular matrix-vector products are used extensively throughout machine learning and are fundamental to neural networks such as multi-layer perceptrons, but are notably absent as normalizing flow layers. This paper identifies this methodological gap and plugs it with a tall and wide MVP change of variables formula. Our theory builds up to a practical algorithm that envelops existing dimensionality increasing flow methods such as augmented flows. We show that tall MVPs are closely related to the stochastic inverse of wide MVPs and empirically demonstrate that they improve density estimation over existing dimension changing methods.

Paper

Github

High Confidence Generalization for Reinforcement Learning

By James Kostas, Yash Chandak, Scott Jordan, Georgios Theocharous, Philip Thomas

We present several classes of reinforcement learning algorithms that safely generalize to Markov decision processes (MDPs) not seen during training. Specifically, we study the setting in which some set of MDPs is accessible for training. For various definitions of safety, our algorithms give probabilistic guarantees that agents can safely generalize to MDPs that are sampled from the same distribution but are not necessarily in the training set. These algorithms are a type of Seldonian algorithm (Thomas et al., 2019), which is a class of machine learning algorithms that return models with probabilistic safety guarantees for user-specified definitions of safety.

Paper

On the Difficulty of Unbiased Alpha Divergence Minimization

By Tomas Geffner and Justin Domke

Short description: Variational inference approximates a target distribution with a simpler one. While traditional inference minimizes the “inclusive” KL-divergence, several algorithms have recently been proposed to minimize other divergences. Experimentally, however, these algorithms often seem to fail to converge. In this paper we analyze the variance of the underlying estimators for these papers. Our results are very pessimistic: For any divergence except the traditional one, the signal-to-noise ratio of the gradient estimator decays exponentially in the dimensionality.

Paper

Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods

By Chris Nota, Bruno C. da Silva, Philip S. Thomas

Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.

Paper

Towards Practical Mean Bounds for Small Samples

By My Phan, Philip S. Thomas, Erik Learned-Miller

Historically, to bound the mean for small sample sizes, practitioners have had to choose between using methods with unrealistic assumptions about the unknown distribution (e.g., Gaussianity) and methods like Hoeffding's inequality that use weaker assumptions but produce much looser (wider) intervals. In 1969, Anderson proposed a mean confidence interval strictly better than or equal to Hoeffding's whose only assumption is that the distribution's support is contained in an interval [a, b]. For the first time since then, we present a new family of bounds that compares favorably to Anderson's. We prove that each bound in the family has guaranteed coverage, i.e., it holds with probability at least 1−α for all distributions on an interval [a, b]. Furthermore, one of the bounds is tighter than or equal to Anderson's for all samples. In simulations, we show that for many distributions, the gain over Anderson's bound is substantial.

Paper

How and Why to Use Experimental Data to Evaluate Methods for Observational Causal Inference

By Amanda M Gentzel, Purva Pruthi, David Jensen

Methods that infer causal dependence from observational data are central to many areas of science, including medicine, economics, and the social sciences. We describe and analyze observational sampling from randomized controlled trials (OSRCT). This method is used to create observational data sets with corresponding unbiased estimates of treatment effect, increasing the number of data sets available for evaluating causal inference methods. We show that, OSRCT creates data sets that are equivalent to those produced by randomly sampling from empirical data sets in which all potential outcomes are available. We then perform a large-scale evaluation and find notable performance differences when comparing across data from different sources, demonstrating the importance of using data from a variety of sources when evaluating any causal inference method.

Paper

DeepWalking Backwards: From Node Embeddings Back to Graphs

By Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos E. Tsourakakis

We investigate whether node embeddings, which are vector representations of graph nodes, can be inverted to approximately recover the graph used to generate them. We present algorithms that invert embeddings from the popular DeepWalk method. In experiments on real-world networks, we find that significant information about the original graph, such as specific edges, is often lost through the process of embedding and inversion; however, community structure is often preserved or even enhanced. Our findings are a step towards a more rigorous understanding of what information embeddings encode about the input graph, and why this information is useful for learning tasks.

Paper

arXiv

Faster Kernel Matrix Algebra via Density Estimation

By Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner

Consider an n x n Gaussian kernel matrix corresponding to n input points in d dimensions. We show that one can compute a relative error approximation to the sum of entries in this matrix in just O(dn^{2/3}) time. This is significantly sublinear in the number of entries in the matrix – which is n^2. Our algorithm combines a novel analysis of entrywise sampling with fast kernel density estimation methods based on locality sensitive hashing. We extend our results to other popular kernels and build on this basic result to give other fast linear algebraic primitives for kernel matrices. For example, we give the first subquadratic time algorithms for approximating the top eigenvector and top eigenvalue of a Gaussian kernel matrix.

Paper

Structural Credit Assignment in Neural Networks using Reinforcement Learning

By Dhawal Gupta, Gabor Mihucz, Matthew Schlegel, James Kostas, Philip Thomas, Martha White

In this work, we revisit REINFORCE and investigate if we can leverage other reinforcement learning approaches to improve learning. We formalize training a neural network as a finite-horizon reinforcement learning problem and discuss how this facilitates using ideas from reinforcement learning like off-policy learning. We show that the standard on-policy REINFORCE algorithm, even with variance reduction approaches, learns sub-optimal solutions. We introduce an off-policy approach, to facilitate reasoning about the greedy action for other agents and help overcome stochasticity in other agents. We conclude by showing that these networks of agents can be more robust to correlated samples when learning online.

Paper

Universal Off-Policy Evaluation

By Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip Thomas

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO) -- one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss Uno's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.

Paper

SOPE: Spectrum of Off-Policy Estimators

By Christina Yuan, Yash Chandak, Stephen Giguere, Philip Thomas, Scott Niekum

Off-policy evaluation (OPE) of a new policy using historical data has usage in many high-stake applications. Importance sampling (IS) being one of the most common OPE method provides unbiased estimates but has high variance. IS methods based on stationary distributions (SIS) have recently been adopted, which often provide lower variance estimates, but can be biased. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. We then show that estimators in this spectrum can achieve lower mean-squared error than both IS and SIS.

Paper

MCMC Variational Inference via Uncorrected Hamiltonian Annealing

By Tomas Geffner, Justin Domke

Annealed Importance Sampling (AIS) with Hamiltonian MCMC can be used to get tight lower bounds on a distribution's (log) normalization constant. Its main drawback is that it uses non-differentiable transition kernels, which makes tuning its many parameters hard. We propose a framework to use an AIS-like procedure with Uncorrected Hamiltonian MCMC, called Uncorrected Hamiltonian Annealing. Our method leads to tight and differentiable bounds. Additionally, we observe empirically that the ability to tune all of our method's parameters using unbiased reparameterization gradients leads to significant gains in performance.

Paper

Amortized Variational Inference for Simple Hierarchical Models

By Abhinav Agrawal, Justin Domke

It is difficult to use subsampling with variational inference in hierarchical models since the number of local latent variables scales with the dataset. Thus, inference in hierarchical models remains a challenge at large scale. It is helpful to use a variational family with structure matching the posterior, but optimization is still slow due to the huge number of local distributions. Instead, this paper suggests an amortized approach where shared parameters simultaneously represent all local distributions and the encoder network only requires the local observations as input. This approach is similarly accurate as using a given joint distribution (e.g., a full- rank Gaussian) but is feasible on datasets that are several orders of magnitude larger. It is also dramatically faster than using a structured variational distribution.

Paper

Relaxed Marginal Consistency for Differentially Private Query Answering

By Ryan McKenna, Siddhant Pradhan, Daniel Sheldon, Gerome Miklau

Differentially private algorithms for answering database queries often involve reconstruction of a discrete distribution from noisy measurements. PRIVATE-PGM is a recent exact inference based technique that scales well for sparse measurements and provides consistent and accurate answers. However it fails to run in high dimensions with dense measurements. This work overcomes the scalability limitation of PRIVATE-PGM on dense data by relaxing consistency constraints. Our new approach works with many existing private query answering algorithms and improves scalability or accuracy with no privacy cost.

Paper

Github

Pareto-Optimal Learning-Augmented Algorithms for Online Conversion Problems

By Bo Sun, Russell Lee, Mohammad Hajiesmaili, Adam Wierman, Danny Tsang

In this work, we leverage machine-learned predictions to design competitive algorithms for online conversion problems with the goal of improving the competitive ratio when predictions are accurate (i.e., consistency), while also guaranteeing a worst-case competitive ratio regardless of the prediction quality (i.e., robustness). We unify the algorithmic design of both integral and fractional conversion problems, into a class of online threshold-based algorithms (OTA). By incorporating predictions into the design of OTA, we achieve the Pareto-optimal trade-off of consistency and robustness, i.e., no online algorithm can achieve a better consistency guarantee given for a robustness guarantee. We demonstrate the performance of OTA using numerical experiments on Bitcoin conversion.

Paper

Poster

Slides

Cooperative Stochastic Bandits with Asynchronous Agents and Constrained Feedback

By Lin Yang, Yu-Zhen Janice Chen, Stephen Pasteris, Mohammad Hajiesmaili, John Lui, Don Towsley

This paper studies a cooperative multi-armed bandit problem with M agents cooperating together to solve the same instance of a K-armed stochastic bandit problem. The agents are heterogeneous in their limited access to a local subset of arms; and their decision-making rounds. The goal is to find the global optimal arm and agents are able to pull any arm, however, they observe the reward only when the selected arm is local. The challenge is a tradeoff for agents between pulling a local arm with the possibility of observing the feedback, or relying on the observations of other agents that might occur at different rates. We propose a two-stage learning algorithm, whose regret matches the regret lower bound up to a K factor.

Paper

Poster

Slides

Turing Completeness of Bounded-Precision Recurrent Neural Networks

By Stephen Chung, Hava Siegelmann

Previous works have proved that recurrent neural networks (RNNs) are Turing-complete. In the proofs, the RNNs allow for neurons with unbounded precision, which is neither practical in implementation nor biologically plausible. To remove this assumption, we propose a dynamically growing memory module made of neurons of fixed precision. We prove that a 54-neuron bounded-precision RNN with growing memory modules can simulate a Universal Turing Machine, with time complexity linear in the simulated machine’s time and independent of the memory size. The result is extendable to other stack-augmented RNNs. Furthermore, we analyze the Turing completeness of both unbounded-precision and bounded-precision RNNs.

Paper

MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents

By Stephen Chung

Most deep learning algorithms rely on error backpropagation, which is generally regarded as biologically implausible. An alternative way of training an artificial neural network is through treating each unit in the network as a reinforcement learning agent. As such, all units can be trained by REINFORCE. However, this learning method suffers from high variance and thus the low speed of learning. We propose a novel algorithm called MAP propagation to reduce this variance significantly while retaining the local property of the learning rule. Experiments demonstrated that MAP propagation could solve common reinforcement learning tasks at a similar speed to backpropagation.

Paper

Blog

Coresets for Classification – Simplified and Strengthened

By Tung Mai, Anup B. Rao, and Cameron Musco

We show how to sample a small subset of points from a larger dataset, such that if we solve logistic regression, hinge loss regression (i.e., soft margin SVM), or a number of other problems used to train linear classifiers on the sampled dataset, then we obtain a near optimal solution for the full dataset. This 'coreset' guarantee requires sampling the subset of points according to a carefully chosen distribution, which reflects each point's importance. We use a distribution based on the l_1 Lewis weights, which are closely related to the statistical leverage scores. This allows us to significantly improve the state-of-the-art for the problem.

Paper