We study nonlinear optimization problems with stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm, using a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of augmented Lagrangian and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the prior work \cite{Na2021Adaptive} by allowing nonlinear inequality constraints. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in the CUTEst test set.
The unlabeled sensing problem is to solve a noisy linear system of equations under unknown permutation of the measurements. We study a particular case of the problem where the permutations are restricted to be r-local, i.e. the permutation matrix is block diagonal with r x r blocks. Assuming a Gaussian measurement matrix, we argue that the r-local permutation model is more challenging compared to a recent sparse permutation model. We propose a proximal alternating minimization algorithm for the general unlabeled sensing problem that provably converges to a first order stationary point. Applied to the r-local model, we show that the resulting algorithm is efficient. We validate the algorithm on synthetic and real datasets. We also formulate the 1-d unassigned distance geometry problem as an unlabeled sensing problem with a structured measurement matrix.
The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, several important questions regarding the convergence properties of SEG are still open, including the sampling of stochastic gradients, mini-batching, convergence guarantees for the monotone finite-sum variational inequalities with possibly non-monotone terms, and others. To address these questions, in this paper, we develop a novel theoretical framework that allows us to analyze several variants of SEG in a unified manner. Besides standard setups, like Same-Sample SEG under Lipschitzness and monotonicity or Independent-Samples SEG under uniformly bounded variance, our approach allows us to analyze variants of SEG that were never explicitly considered in the literature before. Notably, we analyze SEG with arbitrary sampling which includes importance sampling and various mini-batching strategies as special cases. Our rates for the new variants of SEG outperform the current state-of-the-art convergence guarantees and rely on less restrictive assumptions.
In the statistical literature, sparse modeling is the standard approach to achieve improvements in prediction tasks and interpretability. Alternatively, in the seminal paper "Statistical Modeling: The Two Cultures," Breiman (2001) advocated for the adoption of algorithmic approaches to generate ensembles to achieve superior prediction accuracy than single-model methods at the cost of loss of interpretability. In a recent important and critical paper, Rudin (2019) argued that blackbox algorithmic approaches should be avoided for high-stakes decisions and that the tradeoff between accuracy and interpretability is a myth. In response to this recent change in philosophy, we generalize best subset selection (BSS) to best split selection (BSpS), a data-driven approach aimed at finding the optimal split of predictor variables among the models of an ensemble. The proposed methodology results in an ensemble of sparse and diverse models that provide possible mechanisms that explain the relationship between the predictors and the response. The high computational cost of BSpS motivates the need for computational tractable ways to approximate the exhaustive search, and we benchmark one such recent proposal by Christidis et al. (2020) based on a multi-convex relaxation. Our objective with this article is to motivate research in this new exciting field with great potential for data analysis tasks for high-dimensional data.
The aim of this study is the weak convergence rate of a temporal and spatial discretization scheme for stochastic Cahn-Hilliard equation with additive noise, where the spectral Galerkin method is used in space and the backward Euler scheme is used in time. The presence of the unbounded operator in front of the nonlinear term and the lack of the associated Kolmogorov equations make the error analysis much more challenging and demanding. To overcome these difficulties, we further exploit a novel approach proposed in [7] and combine it with Malliavin calculus to obtain an improved weak rate of convergence, in comparison with the corresponding strong convergence rates. The techniques used here are quite general and hence have the potential to be applied to other non-Markovian equations. As a byproduct the rate of the strong error can also be easily obtained.
We study flow scheduling under node capacity constraints. We are given capacitated nodes and an online sequence of jobs, each with a release time and a demand to be routed between two nodes. A schedule specifies which jobs are routed in each step, guaranteeing that the total demand on a node in any step is at most its capacity. A key metric in this scenario is response time: the time between a job's release and its completion. Prior work shows no un-augmented algorithm is competitive for average response time, and that a constant factor competitive ratio is achievable with augmentation exceeding 2 (Dinitz-Moseley Infocom 2020). For maximum response time, the best known result is a 2-competitive algorithm with a augmentation 4 (Jahanjou et al SPAA 2020). We improve these bounds under various response time objectives. We show that, without resource augmentation, the best competitive ratio for maximum response time is $\Omega(n)$, where $n$ is the number of nodes. Our Proportional Allocation algorithm uses $(1+\varepsilon)$ resource augmentation to achieve a $(1/\varepsilon)$-competitive ratio in the setting with general demands and capacities, and splittable jobs. Our Batch Decomposition algorithm is $2$-competitive (resp., optimal) for maximum response time using resource augmentation 2 (resp., 4) in the setting with unit demands and capacities, and unsplittable jobs. We also derive bounds for the simultaneous approximation of average and maximum response time metrics.
We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least squares regression. We consider robustness of call option pricing and deduce a new Black-Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.
This paper investigates the problem of online statistical inference of model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the function-value query complexity and the distribution of search directions. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choices of random search directions to minimize the asymptotic covariance matrix, and conclude that the optimal search direction depends on the optimality criteria with respect to different summary statistics of the Fisher information matrix. Based on the asymptotic distribution result, we conduct online statistical inference by providing two construction procedures of valid confidence intervals. We provide numerical experiments verifying our theoretical results with the practical effectiveness of the procedures.
Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of recent years. We believe the identified problems deserve more attention, and highlight the implications on both the properties of GANs and the trajectory of research on probabilistic models. We recently proposed an alternative method that sidesteps these problems.
Owing to the recent advances in "Big Data" modeling and prediction tasks, variational Bayesian estimation has gained popularity due to their ability to provide exact solutions to approximate posteriors. One key technique for approximate inference is stochastic variational inference (SVI). SVI poses variational inference as a stochastic optimization problem and solves it iteratively using noisy gradient estimates. It aims to handle massive data for predictive and classification tasks by applying complex Bayesian models that have observed as well as latent variables. This paper aims to decentralize it allowing parallel computation, secure learning and robustness benefits. We use Alternating Direction Method of Multipliers in a top-down setting to develop a distributed SVI algorithm such that independent learners running inference algorithms only require sharing the estimated model parameters instead of their private datasets. Our work extends the distributed SVI-ADMM algorithm that we first propose, to an ADMM-based networked SVI algorithm in which not only are the learners working distributively but they share information according to rules of a graph by which they form a network. This kind of work lies under the umbrella of `deep learning over networks' and we verify our algorithm for a topic-modeling problem for corpus of Wikipedia articles. We illustrate the results on latent Dirichlet allocation (LDA) topic model in large document classification, compare performance with the centralized algorithm, and use numerical experiments to corroborate the analytical results.