We consider maximization of stochastic monotone continuous submodular functions (CSF) with a diminishing return property. Existing algorithms only guarantee the performance \textit{in expectation}, and do not bound the probability of getting a bad solution. This implies that for a particular run of the algorithms, the solution may be much worse than the provided guarantee in expectation. In this paper, we first empirically verify that this is indeed the case. Then, we provide the first \textit{high-probability} analysis of the existing methods for stochastic CSF maximization, namely PGA, boosted PGA, SCG, and SCG++. Finally, we provide an improved high-probability bound for SCG, under slightly stronger assumptions, with a better convergence rate than that of the expected solution. Through extensive experiments on non-concave quadratic programming (NQP) and optimal budget allocation, we confirm the validity of our bounds and show that even in the worst-case, PGA converges to $OPT/2$, and boosted PGA, SCG, SCG++ converge to $(1 - 1/e)OPT$, but at a slower rate than that of the expected solution.
Several kernel based testing procedures are proposed to solve the problem of model selection in the presence of parameter estimation in a family of candidate models. Extending the two sample test of Gretton et al. (2006), we first provide a way of testing whether some data is drawn from a given parametric model (model specification). Second, we provide a test statistic to decide whether two parametric models are equally valid to describe some data (model comparison), in the spirit of Vuong (1989). All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families.Some simulations illustrate the performance of our tests in terms of power and level.
We propose an online learning algorithm for a class of machine learning models under a separable stochastic approximation framework. The essence of our idea lies in the observation that certain parameters in the models are easier to optimize than others. In this paper, we focus on models where some parameters have a linear nature, which is common in machine learning. In one routine of the proposed algorithm, the linear parameters are updated by the recursive least squares (RLS) algorithm, which is equivalent to a stochastic Newton method; then, based on the updated linear parameters, the nonlinear parameters are updated by the stochastic gradient method (SGD). The proposed algorithm can be understood as a stochastic approximation version of block coordinate gradient descent approach in which one part of the parameters is updated by a second-order SGD method while the other part is updated by a first-order SGD. Global convergence of the proposed online algorithm for non-convex cases is established in terms of the expected violation of a first-order optimality condition. Numerical experiments have shown that the proposed method accelerates convergence significantly and produces more robust training and test performance when compared to other popular learning algorithms. Moreover, our algorithm is less sensitive to the learning rate and outperforms the recently proposed slimTrain algorithm. The code has been uploaded to GitHub for validation.
Physical models with uncertain inputs are commonly represented as parametric partial differential equations (PDEs). That is, PDEs with inputs that are expressed as functions of parameters with an associated probability distribution. Developing efficient and accurate solution strategies that account for errors on the space, time and parameter domains simultaneously is highly challenging. Indeed, it is well known that standard polynomial-based approximations on the parameter domain can incur errors that grow in time. In this work, we focus on advection-diffusion problems with parameter-dependent wind fields. A novel adaptive solution strategy is proposed that allows users to combine stochastic collocation on the parameter domain with off-the-shelf adaptive timestepping algorithms with local error control. This is a non-intrusive strategy that builds a polynomial-based surrogate that is adapted sequentially in time. The algorithm is driven by a so-called hierarchical estimator for the parametric error and balances this against an estimate for the global timestepping error which is derived from a scaling argument.
Inverse problems are in many cases solved with optimization techniques. When the underlying model is linear, first-order gradient methods are usually sufficient. With nonlinear models, due to nonconvexity, one must often resort to second-order methods that are computationally more expensive. In this work we aim to approximate a nonlinear model with a linear one and correct the resulting approximation error. We develop a sequential method that iteratively solves a linear inverse problem and updates the approximation error by evaluating it at the new solution. This treatment convexifies the problem and allows us to benefit from established convex optimization methods. We separately consider cases where the approximation is fixed over iterations and where the approximation is adaptive. In the fixed case we show theoretically under what assumptions the sequence converges. In the adaptive case, particularly considering the special case of approximation by first-order Taylor expansion, we show that with certain assumptions the sequence converges to a critical point of the original nonconvex functional. Furthermore, we show that with quadratic objective functions the sequence corresponds to the Gauss-Newton method. Finally, we showcase numerical results superior to the conventional model correction method. We also show, that a fixed approximation can provide competitive results with considerable computational speed-up.
We consider the problem of authenticated communication over a discrete arbitrarily varying channel where the legitimate parties are unaware of whether or not an adversary is present. When there is no adversary, the channel state always takes a default value $s_0$. When the adversary is present, they may choose the channel state sequence based on a non-causal noisy view of the transmitted codewords and the encoding and decoding scheme. We require that the decoder output the correct message with a high probability when there is no adversary, and either output the correct message or reject the transmission when the adversary is present. Further, we allow the transmitter to employ private randomness during encoding that is known neither to the receiver nor the adversary. Our first result proves a dichotomy property for the capacity for this problem -- the capacity either equals zero or it equals the non-adversarial capacity of the channel. Next, we give a sufficient condition for the capacity for this problem to be positive even when the non-adversarial channel to the receiver is stochastically degraded with respect to the channel to the adversary. Our proofs rely on a connection to a standalone authentication problem, where the goal is to accept or reject a candidate message that is already available to the decoder. Finally, we give examples and compare our sufficient condition with other related conditions known in the literature
We demonstrate the relevance of an algorithm called generalized iterative scaling (GIS) or simultaneous multiplicative algebraic reconstruction technique (SMART) and its rescaled block-iterative version (RBI-SMART) in the field of optimal transport (OT). Many OT problems can be tackled through the use of entropic regularization by solving the Schr\"odinger problem, which is an information projection problem, that is, with respect to the Kullback--Leibler divergence. Here we consider problems that have several affine constraints. It is well-known that cyclic information projections onto the individual affine sets converge to the solution. In practice, however, even these individual projections are not explicitly available in general. In this paper, we exchange them for one GIS iteration. If this is done for every affine set, we obtain RBI-SMART. We provide a convergence proof using an interpretation of these iterations as two-step affine projections in an equivalent problem. This is done in a slightly more general setting than RBI-SMART, since we use a mix of explicitly known information projections and GIS iterations. We proceed to specialize this algorithm to several OT applications. First, we find the measure that minimizes the regularized OT divergence to a given measure under moment constraints. Second and third, the proposed framework yields an algorithm for solving a regularized martingale OT problem, as well as a relaxed version of the barycentric weak OT problem. Finally, we show an approach from the literature for unbalanced OT problems.
We study a class of nonconvex nonsmooth optimization problems in which the objective is a sum of two functions: One function is the average of a large number of differentiable functions, while the other function is proper, lower semicontinuous and has a surrogate function that satisfies standard assumptions. Such problems arise in machine learning and regularized empirical risk minimization applications. However, nonconvexity and the large-sum structure are challenging for the design of new algorithms. Consequently, effective algorithms for such scenarios are scarce. We introduce and study three stochastic variance-reduced majorization-minimization (MM) algorithms, combining the general MM principle with new variance-reduced techniques. We provide almost surely subsequential convergence of the generated sequence to a stationary point. We further show that our algorithms possess the best-known complexity bounds in terms of gradient evaluations. We demonstrate the effectiveness of our algorithms on sparse binary classification problems, sparse multi-class logistic regressions, and neural networks by employing several widely-used and publicly available data sets.
In causal inference, sensitivity analysis is important to assess the robustness of study conclusions to key assumptions. We perform sensitivity analysis of the assumption that missing outcomes are missing completely at random. We follow a Bayesian approach, which is nonparametric for the outcome distribution and can be combined with an informative prior on the sensitivity parameter. We give insight in the posterior and provide theoretical guarantees in the form of Bernstein-von Mises theorems for estimating the mean outcome. We study different parametrisations of the model involving Dirichlet process priors on the distribution of the outcome and on the distribution of the outcome conditional on the subject being treated. We show that these parametrisations incorporate a prior on the sensitivity parameter in different ways and discuss the relative merits. We also present a simulation study, showing the performance of the methods in finite sample scenarios.
This work studies the generalization error of gradient methods. More specifically, we focus on how training steps $T$ and step-size $\eta$ might affect generalization in smooth stochastic convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Stochastic Gradient Descent (SGD) under the general non-realizable smooth SCO setting, suggesting that existing stability analyses are tight in step-size and iteration dependence, and that overfitting provably happens. Next, we study the case when the loss is realizable, i.e. an optimal solution minimizes all the data points. Recent works show better rates can be attained but the improvement is reduced when training time is long. Our paper examines this observation by providing excess risk lower bounds for GD and SGD in two realizable settings: 1) $\eta T = \bigO{n}$, and (2) $\eta T = \bigOmega{n}$, where $n$ is the size of dataset. In the first case $\eta T = \bigOmega{n}$, our lower bounds tightly match and certify the respective upper bounds. However, for the case $\eta T = \bigOmega{n}$, our analysis indicates a gap between the lower and upper bounds. A conjecture is proposed that the gap can be closed by improving upper bounds, supported by analyses in two special scenarios.
The Invertible Bloom Lookup Table (IBLT) is a probabilistic data structure for set representation, with applications in network and traffic monitoring. It is known for its ability to list its elements, an operation that succeeds with high probability for sufficiently large table. However, listing can fail even for relatively small sets. This paper extends recent work on the worst-case analysis of IBLT, which guarantees successful listing for all sets of a certain size, by introducing more general IBLT schemes. These schemes allow for greater freedom in the implementation of the insert, delete, and listing operations and demonstrate that the IBLT memory can be reduced while still maintaining successful listing guarantees. The paper also explores the time-memory trade-off of these schemes, some of which are based on linear codes and \(B_h\)-sequences over finite fields.