We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the optimal rate as a result of global optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the misidentification probability, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To deal with this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
In its simplest form, the chemostat consists of microorganisms or cells which grow continually in a specific phase of growth while competing for a single limiting nutrient. Under certain conditions on the cells' growth rate, substrate concentration, and dilution rate, the theory predicts and numerical experiments confirm that a periodically operated chemostat exhibits an "over-yielding" state in which the performance becomes higher than that at the steady-state operation. In this paper we show that an optimal control policy for maximizing the chemostat performance can be accurately and efficiently derived numerically using a novel class of integral-pseudospectral methods and adaptive h-integral-pseudospectral methods composed through a predictor-corrector algorithm. Some new formulas for the construction of Fourier pseudospectral integration matrices and barycentric shifted Gegenbauer quadratures are derived. A rigorous study of the errors and convergence rates of shifted Gegenbauer quadratures as well as the truncated Fourier series, interpolation operators, and integration operators for nonsmooth and generally T-periodic functions is presented. We introduce also a novel adaptive scheme for detecting jump discontinuities and reconstructing a discontinuous function from the pseudospectral data. An extensive set of numerical simulations is presented to support the derived theoretical foundations.
The Stochastic Primal-Dual Hybrid Gradient or SPDHG is an algorithm proposed by Chambolle et al. to efficiently solve a wide class of nonsmooth large-scale optimization problems. In this paper we contribute to its theoretical foundations and prove its almost sure convergence for convex but neither necessarily strongly convex nor smooth functionals, defined on Hilbert spaces of arbitrary dimension. We also prove its convergence for any arbitrary sampling, and for some specific samplings we propose theoretically optimal step size parameters which yield faster convergence. In addition, we propose using SPDHG for parallel Magnetic Resonance Imaging reconstruction, where data from different coils are randomly selected at each iteration. We apply SPDHG using a wide range of random sampling methods. We compare its performance across a range of settings, including mini-batch size, step size parameters, and both convex and strongly convex objective functionals. We show that the sampling can significantly affect the convergence speed of SPDHG. We conclude that for many cases an optimal sampling method can be identified.
We establish optimal convergence rates up to a log-factor for a class of deep neural networks in a classification setting under a restraint sometimes referred to as the Tsybakov noise condition. We construct classifiers in a general setting where the boundary of the bayes-rule can be approximated well by neural networks. Corresponding rates of convergence are proven with respect to the misclassification error. It is then shown that these rates are optimal in the minimax sense if the boundary satisfies a smoothness condition. Non-optimal convergence rates already exist for this setting. Our main contribution lies in improving existing rates and showing optimality, which was an open problem. Furthermore, we show almost optimal rates under some additional restraints which circumvent the curse of dimensionality. For our analysis we require a condition which gives new insight on the restraint used. In a sense it acts as a requirement for the "correct noise exponent" for a class of functions.
We consider the problem of estimating a dose-response curve, both globally and locally at a point. Continuous treatments arise often in practice, e.g. in the form of time spent on an operation, distance traveled to a location or dosage of a drug. Letting A denote a continuous treatment variable, the target of inference is the expected outcome if everyone in the population takes treatment level A=a. Under standard assumptions, the dose-response function takes the form of a partial mean. Building upon the recent literature on nonparametric regression with estimated outcomes, we study three different estimators. As a global method, we construct an empirical-risk-minimization-based estimator with an explicit characterization of second-order remainder terms. As a local method, we develop a two-stage, doubly-robust (DR) learner. Finally, we construct a mth-order estimator based on the theory of higher-order influence functions. Under certain conditions, this higher order estimator achieves the fastest rate of convergence that we are aware of for this problem. However, the other two approaches are easier to implement using off-the-shelf software, since they are formulated as two-stage regression tasks. For each estimator, we provide an upper bound on the mean-square error and investigate its finite-sample performance in a simulation. Finally, we describe a flexible, nonparametric method to perform sensitivity analysis to the no-unmeasured-confounding assumption when the treatment is continuous.
We study \textit{rescaled gradient dynamical systems} in a Hilbert space $\mathcal{H}$, where implicit discretization in a finite-dimensional Euclidean space leads to high-order methods for solving monotone equations (MEs). Our framework can be interpreted as a natural generalization of celebrated dual extrapolation method~\citep{Nesterov-2007-Dual} from first order to high order via appeal to the regularization toolbox of optimization theory~\citep{Nesterov-2021-Implementable, Nesterov-2021-Inexact}. More specifically, we establish the existence and uniqueness of a global solution and analyze the convergence properties of solution trajectories. We also present discrete-time counterparts of our high-order continuous-time methods, and we show that the $p^{th}$-order method achieves an ergodic rate of $O(k^{-(p+1)/2})$ in terms of a restricted merit function and a pointwise rate of $O(k^{-p/2})$ in terms of a residue function. Under regularity conditions, the restarted version of $p^{th}$-order methods achieves local convergence with the order $p \geq 2$. Notably, our methods are \textit{optimal} since they have matched the lower bound established for solving the monotone equation problems under a standard linear span assumption~\citep{Lin-2022-Perseus}.
We show that 11-channel sorting networks have at least 35 comparators and that 12-channel sorting networks have at least 39 comparators. This positively settles the optimality of the corresponding sorting networks given in The Art of Computer Programming vol. 3 and closes the two smallest open instances of the Bose-Nelson sorting problem. We obtain these bounds by generalizing a result of Van Voorhis from sorting networks to a more general class of comparator networks. From this we derive a dynamic programming algorithm that computes the optimal size for a sorting network with a given number of channels. From an execution of this algorithm we construct a certificate containing a derivation of the corresponding lower size bound, which we check using a program formally verified using the Isabelle/HOL proof assistant.
A two-sided matching system is considered, where servers are assumed to arrive at a fixed rate, while the arrival rate of customers is modulated via a price-control mechanism. We analyse a loss model, wherein customers who are not served immediately upon arrival get blocked, as well as a queueing model, wherein customers wait in a queue until they receive service. The objective is to maximize the platform profit generated from matching servers and customers, subject to quality of service constraints, such as the expected wait time of servers in the loss system model, and the stability of the customer queue in the queuing model. For the loss system, subject to a certain relaxation, we show that the optimal policy has a bang-bang structure. We also derive approximation guarantees for simple pricing policies. For the queueing system, we propose a simple bi-modal matching strategy and show that it achieves near optimal profit.
LU and Cholesky matrix factorization algorithms are core subroutines used to solve systems of linear equations (SLEs) encountered while solving an optimization problem. Standard factorization algorithms are highly efficient but remain susceptible to the accumulation of roundoff errors, which can lead solvers to return feasibility and optimality claims that are actually invalid. This paper introduces a novel approach for solving sequences of closely related SLEs encountered in nonlinear programming efficiently and without roundoff errors. Specifically, it introduces rank-one update algorithms for the roundoff-error-free (REF) factorization framework, a toolset built on integer-preserving arithmetic that has led to the development and implementation of fail-proof SLE solution subroutines for linear programming. The formal guarantees of the proposed algorithms are established through the derivation of theoretical insights. Their advantages are supported with computational experiments, which demonstrate upwards of 75x-improvements over exact factorization run-times on fully dense matrices with over one million entries. A significant advantage of the methodology is that the length of any coefficient calculated via the proposed algorithms is bounded polynomially in the size of the inputs without having to resort to greatest common divisor operations, which are required by and thereby hinder an efficient implementation of exact rational arithmetic approaches.
When learning disconnected distributions, Generative adversarial networks (GANs) are known to face model misspecification. Indeed, a continuous mapping from a unimodal latent distribution to a disconnected one is impossible, so GANs necessarily generate samples outside of the support of the target distribution. This raises a fundamental question: what is the latent space partition that minimizes the measure of these areas? Building on a recent result of geometric measure theory, we prove that an optimal GANs must structure its latent space as a 'simplicial cluster' - a Voronoi partition where cells are convex cones - when the dimension of the latent space is larger than the number of modes. In this configuration, each Voronoi cell maps to a distinct mode of the data. We derive both an upper and a lower bound on the optimal precision of GANs learning disconnected manifolds. Interestingly, these two bounds have the same order of decrease: $\sqrt{\log m}$, $m$ being the number of modes. Finally, we perform several experiments to exhibit the geometry of the latent space and experimentally show that GANs have a geometry with similar properties to the theoretical one.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.