Given an array A[1: n] of n elements drawn from an ordered set, the sorted range selection problem is to build a data structure that can be used to answer the following type of queries efficiently: Given a pair of indices i, j $ (1\le i\le j \le n)$, and a positive integer k, report the k smallest elements from the sub-array A[i: j] in order. Brodal et al. (Brodal, G.S., Fagerberg, R., Greve, M., and L{\'o}pez-Ortiz, A., Online sorted range reporting. Algorithms and Computation (2009) pp. 173--182) introduced the problem and gave an optimal solution. After O(n log n) time for preprocessing, the query time is O(k). The space used is O(n). In this paper, we propose the only other possible optimal trade-off for the problem. We present a linear space solution to the problem that takes O(k log k) time to answer a range selection query. The preprocessing time is O(n). Moreover, the proposed algorithm reports the output elements one by one in non-decreasing order. Our solution is simple and practical. We also describe an extremely simple method for range minima queries (most of whose parts are known) which takes al most (but not exactly) linear time. We believe that this method may be, in practice, faster and easier to implement in most cases.
A structured variable selection problem is considered in which the covariates, divided into predefined groups, activate according to sparse patterns with few nonzero entries per group. Capitalizing on the concept of atomic norm, a composite norm can be properly designed to promote such exclusive group sparsity patterns. The resulting norm lends itself to efficient and flexible regularized optimization algorithms for support recovery, like the proximal algorithm. Moreover, an active set algorithm is proposed that builds the solution by successively including structure atoms into the estimated support. It is also shown that such an algorithm can be tailored to match more rigid structures than plain exclusive group sparsity. Asymptotic consistency analysis (with both the number of parameters as well as the number of groups growing with the observation size) establishes the effectiveness of the proposed solution in terms of signed support recovery under conventional assumptions. Finally, a set of numerical simulations further corroborates the results.
We develop a fitted value iteration (FVI) method to compute bicausal optimal transport (OT) where couplings have an adapted structure. Based on the dynamic programming formulation, FVI adopts a function class to approximate the value functions in bicausal OT. Under the concentrability condition and approximate completeness assumption, we prove the sample complexity using (local) Rademacher complexity. Furthermore, we demonstrate that multilayer neural networks with appropriate structures satisfy the crucial assumptions required in sample complexity proofs. Numerical experiments reveal that FVI outperforms linear programming and adapted Sinkhorn methods in scalability as the time horizon increases, while still maintaining acceptable accuracy.
To enable closed form conditioning, a common assumption in Gaussian process (GP) regression is independent and identically distributed Gaussian observation noise. This strong and simplistic assumption is often violated in practice, which leads to unreliable inferences and uncertainty quantification. Unfortunately, existing methods for robustifying GPs break closed-form conditioning, which makes them less attractive to practitioners and significantly more computationally expensive. In this paper, we demonstrate how to perform provably robust and conjugate Gaussian process (RCGP) regression at virtually no additional cost using generalised Bayesian inference. RCGP is particularly versatile as it enables exact conjugate closed form updates in all settings where standard GPs admit them. To demonstrate its strong empirical performance, we deploy RCGP for problems ranging from Bayesian optimisation to sparse variational Gaussian processes.
We construct an algorithm for the accurate solution of mixed integer convex quadratic programming, which is the problem of minimizing a convex quadratic function over mixed integer points in a polyhedron. Our algorithm is fixed parameter tractable with parameter the number of integer variables. In particular, when the number of integer variables is fixed, the running time of our algorithm is bounded by a polynomial of the size of the problem. To design our algorithm, we prove a number of fundamental structural and algorithmic results for mixed integer linear and quadratic programming.
We introduce the BREASE framework for the Bayesian analysis of randomized controlled trials with a binary treatment and a binary outcome. Approaching the problem from a causal inference perspective, we propose parameterizing the likelihood in terms of the baseline risk, efficacy, and adverse side effects of the treatment, along with a flexible, yet intuitive and tractable jointly independent beta prior distribution on these parameters, which we show to be a generalization of the Dirichlet prior for the joint distribution of potential outcomes. Our approach has a number of desirable characteristics when compared to current mainstream alternatives: (i) it naturally induces prior dependence between expected outcomes in the treatment and control groups; (ii) as the baseline risk, efficacy and risk of adverse side effects are quantities commonly present in the clinicians' vocabulary, the hyperparameters of the prior are directly interpretable, thus facilitating the elicitation of prior knowledge and sensitivity analysis; and (iii) we provide analytical formulae for the marginal likelihood, Bayes factor, and other posterior quantities, as well as exact posterior sampling via simulation, in cases where traditional MCMC fails. Empirical examples demonstrate the utility of our methods for estimation, hypothesis testing, and sensitivity analysis of treatment effects.
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this paper, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplemental materials for this article are available online.
Sequential design of experiments for optimizing a reward function in causal systems can be effectively modeled by the sequential design of interventions in causal bandits (CBs). In the existing literature on CBs, a critical assumption is that the causal models remain constant over time. However, this assumption does not necessarily hold in complex systems, which constantly undergo temporal model fluctuations. This paper addresses the robustness of CBs to such model fluctuations. The focus is on causal systems with linear structural equation models (SEMs). The SEMs and the time-varying pre- and post-interventional statistical models are all unknown. Cumulative regret is adopted as the design criteria, based on which the objective is to design a sequence of interventions that incur the smallest cumulative regret with respect to an oracle aware of the entire causal model and its fluctuations. First, it is established that the existing approaches fail to maintain regret sub-linearity with even a few instances of model deviation. Specifically, when the number of instances with model deviation is as few as $T^\frac{1}{2L}$, where $T$ is the time horizon and $L$ is the longest causal path in the graph, the existing algorithms will have linear regret in $T$. Next, a robust CB algorithm is designed, and its regret is analyzed, where upper and information-theoretic lower bounds on the regret are established. Specifically, in a graph with $N$ nodes and maximum degree $d$, under a general measure of model deviation $C$, the cumulative regret is upper bounded by $\tilde{\mathcal{O}}(d^{L-\frac{1}{2}}(\sqrt{NT} + NC))$ and lower bounded by $\Omega(d^{\frac{L}{2}-2}\max\{\sqrt{T},d^2C\})$. Comparing these bounds establishes that the proposed algorithm achieves nearly optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret when $C$ is $o(\sqrt{T})$ and maintains sub-linear regret for a broader range of $C$.
Eigenvector continuation is a computational method for parametric eigenvalue problems that uses subspace projection with a basis derived from eigenvector snapshots from different parameter sets. It is part of a broader class of subspace-projection techniques called reduced-basis methods. In this colloquium article, we present the development, theory, and applications of eigenvector continuation and projection-based emulators. We introduce the basic concepts, discuss the underlying theory and convergence properties, and present recent applications for quantum systems and future prospects.
Cobordism categories are known to be compact closed. They can therefore be used to define non-degenerate models of multiplicative linear logic by combining the Int construction with double glueing. In this work we detail such construction in the case of low-dimensional cobordisms, and exhibit a connexion between those models and the model of Interaction graphs introduced by Seiller. In particular, we exhibit how the so-called trefoil property is a consequence of the associativity of composition of higher structures, providing a first step toward establishing models as obtained from a double glueing construction. We discuss possible extensions to higher-dimensional cobordisms categories
Traditional methods for link prediction can be categorized into three main types: graph structure feature-based, latent feature-based, and explicit feature-based. Graph structure feature methods leverage some handcrafted node proximity scores, e.g., common neighbors, to estimate the likelihood of links. Latent feature methods rely on factorizing networks' matrix representations to learn an embedding for each node. Explicit feature methods train a machine learning model on two nodes' explicit attributes. Each of the three types of methods has its unique merits. In this paper, we propose SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction), a new framework for link prediction which combines the power of all the three types into a single graph neural network (GNN). GNN is a new type of neural network which directly accepts graphs as input and outputs their labels. In SEAL, the input to the GNN is a local subgraph around each target link. We prove theoretically that our local subgraphs also reserve a great deal of high-order graph structure features related to link existence. Another key feature is that our GNN can naturally incorporate latent features and explicit features. It is achieved by concatenating node embeddings (latent features) and node attributes (explicit features) in the node information matrix for each subgraph, thus combining the three types of features to enhance GNN learning. Through extensive experiments, SEAL shows unprecedentedly strong performance against a wide range of baseline methods, including various link prediction heuristics and network embedding methods.