We derive information-theoretic lower bounds on the Bayes risk and generalization error of realizable machine learning models. In particular, we employ an analysis in which the rate-distortion function of the model parameters bounds the required mutual information between the training samples and the model parameters in order to learn a model up to a Bayes risk constraint. For realizable models, we show that both the rate distortion function and mutual information admit expressions that are convenient for analysis. For models that are (roughly) lower Lipschitz in their parameters, we bound the rate distortion function from below, whereas for VC classes, the mutual information is bounded above by $d_\mathrm{vc}\log(n)$. When these conditions match, the Bayes risk with respect to the zero-one loss scales no faster than $\Omega(d_\mathrm{vc}/n)$, which matches known outer bounds and minimax lower bounds up to logarithmic factors. We also consider the impact of label noise, providing lower bounds when training and/or test samples are corrupted.
We propose a theoretical framework for the problem of learning a real-valued function which meets fairness requirements. This framework is built upon the notion of $\alpha$-relative (fairness) improvement of the regression function which we introduce using the theory of optimal transport. Setting $\alpha = 0$ corresponds to the regression problem under the Demographic Parity constraint, while $\alpha = 1$ corresponds to the classical regression problem without any constraints. For $\alpha \in (0, 1)$ the proposed framework allows to continuously interpolate between these two extreme cases and to study partially fair predictors. Within this framework we precisely quantify the cost in risk induced by the introduction of the fairness constraint. We put forward a statistical minimax setup and derive a general problem-dependent lower bound on the risk of any estimator satisfying $\alpha$-relative improvement constraint. We illustrate our framework on a model of linear regression with Gaussian design and systematic group-dependent bias, deriving matching (up to absolute constants) upper and lower bounds on the minimax risk under the introduced constraint. We provide a general post-processing strategy which enjoys fairness, risk guarantees and can be applied on top of any black-box algorithm. Finally, we perform a simulation study of the linear model and numerical experiments of benchmark data, validating our theoretical contributions.
We consider a platform's problem of collecting data from privacy sensitive users to estimate an underlying parameter of interest. We formulate this question as a Bayesian-optimal mechanism design problem, in which an individual can share her (verifiable) data in exchange for a monetary reward or services, but at the same time has a (private) heterogeneous privacy cost which we quantify using differential privacy. We consider two popular differential privacy settings for providing privacy guarantees for the users: central and local. In both settings, we establish minimax lower bounds for the estimation error and derive (near) optimal estimators for given heterogeneous privacy loss levels for users. Building on this characterization, we pose the mechanism design problem as the optimal selection of an estimator and payments that will elicit truthful reporting of users' privacy sensitivities. Under a regularity condition on the distribution of privacy sensitivities we develop efficient algorithmic mechanisms to solve this problem in both privacy settings. Our mechanism in the central setting can be implemented in time $\mathcal{O}(n \log n)$ where $n$ is the number of users and our mechanism in the local setting admits a Polynomial Time Approximation Scheme (PTAS).
We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.
This paper focuses on stochastic saddle point problems with decision-dependent distributions in both the static and time-varying settings. These are problems whose objective is the expected value of a stochastic payoff function, where random variables are drawn from a distribution induced by a distributional map. For general distributional maps, the problem of finding saddle points is in general computationally burdensome, even if the distribution is known. To enable a tractable solution approach, we introduce the notion of equilibrium points -- which are saddle points for the stationary stochastic minimax problem that they induce -- and provide conditions for their existence and uniqueness. We demonstrate that the distance between the two classes of solutions is bounded provided that the objective has a strongly-convex-strongly-concave payoff and Lipschitz continuous distributional map. We develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence to the equilibrium point. In particular, by modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration; moreover, we show convergence to a neighborhood in expectation and almost surely. Finally, we investigate a condition on the distributional map -- which we call opposing mixture dominance -- that ensures the objective is strongly-convex-strongly-concave. Under this assumption, we show that primal-dual algorithms converge to the saddle points in a similar fashion.
We in this paper theoretically go over a rate-distortion based sparse dictionary learning problem. We show that the Degrees-of-Freedom (DoF) interested to be calculated $-$ satnding for the minimal set that guarantees our rate-distortion trade-off $-$ are basically accessible through a Langevin equation. We indeed explore that the relative time evolution of DoF, i.e., the transition jumps is the essential issue for a relaxation over the relative optimisation problem. We subsequently prove the aforementioned relaxation through the \textit{Graphon} principle w.r.t. a stochastic Chordal Schramm-Loewner evolution etc via a minimisation over a distortion between the relative realisation times of two given graphs $\mathscr{G}_1$ and $\mathscr{G}_2$ as $ \mathop{{\rm \mathbb{M}in}}\limits_{ \mathscr{G}_1, \mathscr{G}_2} {\rm \; } \mathcal{D} \Big( t\big( \mathscr{G}_1 , \mathscr{G} \big) , t\big( \mathscr{G}_2 , \mathscr{G} \big) \Big)$. We also extend our scenario to the eavesdropping case. We finally prove the efficiency of our proposed scheme via simulations.
This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are near minimax rate-optimal, achieving the parametric rate up to logarithmic factors.
Many-user MAC is an important model for understanding energy efficiency of massive random access in 5G and beyond. Introduced in Polyanskiy'2017 for the AWGN channel, subsequent works have provided improved bounds on the asymptotic minimum energy-per-bit required to achieve a target per-user error at a given user density and payload, going beyond the AWGN setting. The best known rigorous bounds use spatially coupled codes along with the optimal AMP algorithm. But these bounds are infeasible to compute beyond a few (around 10) bits of payload. In this paper, we provide new achievability bounds for the many-user AWGN and quasi-static Rayleigh fading MACs using the spatially coupled codebook design along with a scalar AMP algorithm. The obtained bounds are computable even up to 100 bits and outperform the previous ones at this payload.
Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic ``expansion'' assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.