We consider box-constrained integer programs with objective $g(Wx) + c^T x$, where $g$ is a "complicated" function with an $m$ dimensional domain. Here we assume we have $n \gg m$ variables and that $W \in \mathbb Z^{m \times n}$ is an integer matrix with coefficients of absolute value at most $\Delta$. We design an algorithm for this problem using only the mild assumption that the objective can be optimized efficiently when all but $m$ variables are fixed, yielding a running time of $n^m(m \Delta)^{O(m^2)}$. Moreover, we can avoid the term $n^m$ in several special cases, in particular when $c = 0$. Our approach can be applied in a variety of settings, generalizing several recent results. An important application are convex objectives of low domain dimension, where we imply a recent result by Hunkenschr\"oder et al. [SIOPT'22] for the 0-1-hypercube and sharp or separable convex $g$, assuming $W$ is given explicitly. By avoiding the direct use of proximity results, which only holds when $g$ is separable or sharp, we match their running time and generalize it for arbitrary convex functions. In the case where the objective is only accessible by an oracle and $W$ is unknown, we further show that their proximity framework can be implemented in $n (m \Delta)^{O(m^2)}$-time instead of $n (m \Delta)^{O(m^3)}$. Lastly, we extend the result by Eisenbrand and Weismantel [SODA'17, TALG'20] for integer programs with few constraints to a mixed-integer linear program setting where integer variables appear in only a small number of different constraints.
We consider polyregular functions, which are certain string-to-string functions that have polynomial output size. We prove that a polyregular function has output size $\mathcal O(n^k)$ if and only if it can be defined by an MSO interpretation of dimension $k$, i.e. a string-to-string transformation where every output position is interpreted, using monadic second-order logic MSO, in some $k$-tuple of input positions. We also show that this characterization does not extend to pebble transducers, another model for describing polyregular functions: we show that for every $k \in \{1,2,\ldots\}$ there is a polyregular function of quadratic output size which needs at least $k$ pebbles to be computed.
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
Quantitative notions of bisimulation are well-known tools for the minimization of dynamical models such as Markov chains and ordinary differential equations (ODEs). In \emph{forward bisimulations}, each state in the quotient model represents an equivalence class and the dynamical evolution gives the overall sum of its members in the original model. Here we introduce generalized forward bisimulation (GFB) for dynamical systems over commutative monoids and develop a partition refinement algorithm to compute the coarsest one. When the monoid is $(\mathbb{R}, +)$, we recover %our framework recovers probabilistic bisimulation for Markov chains and more recent forward bisimulations for %systems of nonlinear ODEs. %ordinary differential equations. Using $(\mathbb{R}, \cdot)$ we get %When the monoid is $(\mathbb{R}, \cdot)$ we can obtain nonlinear reductions for discrete-time dynamical systems and ODEs %ordinary differential equations where each variable in the quotient model represents the product of original variables in the equivalence class. When the domain is a finite set such as the Booleans $\mathbb{B}$, we can apply GFB to Boolean networks (BN), a widely used dynamical model in computational biology. Using a prototype implementation of our minimization algorithm for GFB, we find disjunction- and conjunction-preserving reductions on 60 BN from two well-known repositories, and demonstrate the obtained analysis speed-ups. We also provide the biological interpretation of the reduction obtained for two selected BN, and we show how GFB enables the analysis of a large one that could not be analyzed otherwise. Using a randomized version of our algorithm we find product-preserving (therefore non-linear) reductions on 21 dynamical weighted networks from the literature that could not be handled by the exact algorithm.
Obtaining guarantees on the convergence of the minimizers of empirical risks to the ones of the true risk is a fundamental matter in statistical learning. Instead of deriving guarantees on the usual estimation error, the goal of this paper is to provide concentration inequalities on the distance between the sets of minimizers of the risks for a broad spectrum of estimation problems. In particular, the risks are defined on metric spaces through probability measures that are also supported on metric spaces. A particular attention will therefore be given to include unbounded spaces and non-convex cost functions that might also be unbounded. This work identifies a set of assumptions allowing to describe a regime that seem to govern the concentration in many estimation problems, where the empirical minimizers are stable. This stability can then be leveraged to prove parametric concentration rates in probability and in expectation. The assumptions are verified, and the bounds showcased, on a selection of estimation problems such as barycenters on metric space with positive or negative curvature, subspaces of covariance matrices, regression problems and entropic-Wasserstein barycenters.
The idea of embedding optimization problems into deep neural networks as optimization layers to encode constraints and inductive priors has taken hold in recent years. Most existing methods focus on implicitly differentiating Karush-Kuhn-Tucker (KKT) conditions in a way that requires expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In this paper, we developed a new framework, named Alternating Differentiation (Alt-Diff), that differentiates optimization problems (here, specifically in the form of convex optimization problems with polyhedral constraints) in a fast and recursive way. Alt-Diff decouples the differentiation procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix especially for optimization with large-scale constraints and thus increases the computational speed of implicit differentiation. We show that the gradients obtained by Alt-Diff are consistent with those obtained by differentiating KKT conditions. In addition, we propose to truncate Alt-Diff to further accelerate the computational speed. Under some standard assumptions, we show that the truncation error of gradients is upper bounded by the same order of variables' estimation error. Therefore, Alt-Diff can be truncated to further increase computational speed without sacrificing much accuracy. A series of comprehensive experiments validate the superiority of Alt-Diff.
Practitioners often use data from a randomized controlled trial to learn a treatment assignment policy that can be deployed on a target population. A recurring concern in doing so is that, even if the randomized trial was well-executed (i.e., internal validity holds), the study participants may not represent a random sample of the target population (i.e., external validity fails)--and this may lead to policies that perform suboptimally on the target population. We consider a model where observable attributes can impact sample selection probabilities arbitrarily but the effect of unobservable attributes is bounded by a constant, and we aim to learn policies with the best possible performance guarantees that hold under any sampling bias of this type. In particular, we derive the partial identification result for the worst-case welfare in the presence of sampling bias and show that the optimal max-min, max-min gain, and minimax regret policies depend on both the conditional average treatment effect (CATE) and the conditional value-at-risk (CVaR) of potential outcomes given covariates. To avoid finite-sample inefficiencies of plug-in estimates, we further provide an end-to-end procedure for learning the optimal max-min and max-min gain policies that does not require the separate estimation of nuisance parameters.
We propose a novel technique for analyzing adaptive sampling called the {\em Simulator}. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as $\delta \to 0$ found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each \emph{individual} arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first {\em practical} algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.
This paper attempts to characterize the kinds of physical scenarios in which an online learning-based cognitive radar is expected to reliably outperform a fixed rule-based waveform selection strategy, as well as the converse. We seek general insights through an examination of two decision-making scenarios, namely dynamic spectrum access and multiple-target tracking. The radar scene is characterized by inducing a state-space model and examining the structure of its underlying Markov state transition matrix, in terms of entropy rate and diagonality. It is found that entropy rate is a strong predictor of online learning-based waveform selection, while diagonality is a better predictor of fixed rule-based waveform selection. We show that these measures can be used to predict first and second-order stochastic dominance relationships, which can allow system designers to make use of simple decision rules instead of more cumbersome learning approaches under certain conditions. We validate our findings through numerical results for each application and provide guidelines for future implementations.
Multiplication is indispensable and is one of the core operations in many modern applications including signal processing and neural networks. Conventional right-to-left (RL) multiplier extensively contributes to the power consumption, area utilization and critical path delay in such applications. This paper proposes a low latency multiplier based on online or left-to-right (LR) arithmetic which can increase throughput and reduce latency by digit-level pipelining. Online arithmetic enables overlapping successive operations regardless of data dependency because of the most significant digit first mode of operation. To produce most significant digit first, it uses redundant number system and we can have a carry-free addition, therefore, the delay of the arithmetic operation is independent of operand bit width. The operations are performed digit by digit serially from left to right which allows gradual increase in the slice activities making it suitable for implementation on reconfigurable devices. Serial nature of the online algorithm and gradual increment/decrement of active slices minimize the interconnects and signal activities resulting in overall reduction of area and power consumption. We present online multipliers with; both inputs in serial, and one in serial and one in parallel. Pipelined and non-pipelined designs of the proposed multipliers have been synthesized with GSCL 45nm technology on Synopsys Design Compiler. Thorough comparative analysis has been performed using widely used performance metrics. The results show that the proposed online multipliers outperform the RL multipliers.
Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.