We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability, and establish the corresponding generalizations of our characterizations for every set of unary FDs.
This paper introduces the exponential substitution calculus (ESC), a new presentation of cut elimination for IMELL, based on proof terms and building on the idea that exponentials can be seen as explicit substitutions. The idea in itself is not new, but here it is pushed to a new level, inspired by Accattoli and Kesner's linear substitution calculus (LSC). One of the key properties of the LSC is that it naturally models the sub-term property of abstract machines, that is the key ingredient for the study of reasonable time cost models for the $\lambda$-calculus. The new ESC is then used to design a cut elimination strategy with the sub-term property, providing the first polynomial cost model for cut elimination with unconstrained exponentials. For the ESC, we also prove untyped confluence and typed strong normalization, showing that it is an alternative to proof nets for an advanced study of cut elimination.
The purpose of this paper is to examine the sampling problem through Euler discretization, where the potential function is assumed to be a mixture of locally smooth distributions and weakly dissipative. We introduce $\alpha_{G}$-mixture locally smooth and $\alpha_{H}$-mixture locally Hessian smooth, which are novel and typically satisfied with a mixture of distributions. Under our conditions, we prove the convergence in Kullback-Leibler (KL) divergence with the number of iterations to reach $\epsilon$-neighborhood of a target distribution in only polynomial dependence on the dimension. The convergence rate is improved when the potential is $1$-smooth and $\alpha_{H}$-mixture locally Hessian smooth. Our result for the non-strongly convex outside the ball of radius $R$ is obtained by convexifying the non-convex domains. In addition, we provide some nice theoretical properties of $p$-generalized Gaussian smoothing and prove the convergence in the $L_{\beta}$-Wasserstein distance for stochastic gradients in a general setting.
Using techniques developed recently in the field of compressed sensing we prove new upper bounds for general (non-linear) sampling numbers of (quasi-)Banach smoothness spaces in $L^2$. In relevant cases such as mixed and isotropic weighted Wiener classes or Sobolev spaces with mixed smoothness, sampling numbers in $L^2$ can be upper bounded by best $n$-term trigonometric widths in $L^\infty$. We describe a recovery procedure based on $\ell^1$-minimization (basis pursuit denoising) using only $m$ function values. With this method, a significant gain in the rate of convergence compared to recently developed linear recovery methods is achieved. In this deterministic worst-case setting we see an additional speed-up of $n^{-1/2}$ compared to linear methods in case of weighted Wiener spaces. For their quasi-Banach counterparts even arbitrary polynomial speed-up is possible. Surprisingly, our approach allows to recover mixed smoothness Sobolev functions belonging to $S^r_pW(\mathbb{T}^d)$ on the $d$-torus with a logarithmically better rate of convergence than any linear method can achieve when $1 < p < 2$ and $d$ is large. This effect is not present for isotropic Sobolev spaces.
A key tool to carry out inference on the unknown copula when modeling a continuous multivariate distribution is a nonparametric estimator known as the empirical copula. One popular way of approximating its sampling distribution consists of using the multiplier bootstrap. The latter is however characterized by a high implementation cost. Given the rank-based nature of the empirical copula, the classical empirical bootstrap of Efron does not appear to be a natural alternative, as it relies on resamples which contain ties. The aim of this work is to investigate the use of subsampling in the aforementioned framework. The latter consists of basing the inference on statistic values computed from subsamples of the initial data. One of its advantages in the rank-based context under consideration is that the formed subsamples do not contain ties. Another advantage is its asymptotic validity under minimalistic conditions. In this work, we show the asymptotic validity of subsampling for several (weighted, smooth) empirical copula processes both in the case of serially independent observations and time series. In the former case, subsampling is observed to be substantially better than the empirical bootstrap and equivalent, overall, to the multiplier bootstrap in terms of finite-sample performance.
We study the top-$k$ selection problem under the differential privacy model: $m$ items are rated according to votes of a set of clients. We consider a setting in which algorithms can retrieve data via a sequence of accesses, each either a random access or a sorted access; the goal is to minimize the total number of data accesses. Our algorithm requires only $O(\sqrt{mk})$ expected accesses: to our knowledge, this is the first sublinear data-access upper bound for this problem. Accompanying this, we develop the first lower bounds for the problem, in three settings: only random accesses; only sorted acceses; a sequence of accesses of either kind. We show that, to avoid $\Omega(m)$ access cost, supporting \emph{either} kind of access, i.e. the freedom to mix, is necessary, and that in this case our algorithm's access cost is almost optimal.
This paper presents a new strategy to deal with the excessive diffusion that standard finite volume methods for compressible Euler equations display in the limit of low Mach number. The strategy can be understood as using centered discretizations for the acoustic part of the Euler equations and stabilizing them with a leap-frog-type ("sequential explicit") time integration, a fully explicit method. This time integration takes inspiration from time-explicit staggered grid numerical methods. In this way, advantages of staggered methods carry over to collocated methods. The paper provides a number of new collocated schemes for linear acoustic/Maxwell equations that are inspired by the Yee scheme. They are then extended to an all-speed method for the full Euler equations on Cartesian grids. By taking the opposite view and taking inspiration from collocated methods, the paper also suggests a new way of staggering the variables which increases the stability as compared to the traditional Yee scheme.
Scientists often must simultaneously localize and discover signals. For instance, in genetic fine-mapping, high correlations between nearby genetic variants make it hard to identify the exact locations of causal variants. So the statistical task is to output as many disjoint regions containing a signal as possible, each as small as possible, while controlling false positives. Similar problems arise in any application where signals cannot be perfectly localized, such as locating stars in astronomical surveys and changepoint detection in sequential data. Common Bayesian approaches to these problems involve computing a posterior distribution over signal locations. However, existing procedures to translate these posteriors into actual credible regions for the signals fail to capture all the information in the posterior, leading to lower power and (sometimes) inflated false discoveries. With this motivation, we introduce Bayesian Linear Programming (BLiP). Given a posterior distribution over signals, BLiP outputs credible regions for signals which verifiably nearly maximize expected power while controlling false positives. BLiP overcomes an extremely high-dimensional and nonconvex problem to verifiably nearly maximize expected power while controlling false positives. BLiP is very computationally efficient compared to the cost of computing the posterior and can wrap around nearly any Bayesian model and algorithm. Applying BLiP to existing state-of-the-art analyses of UK Biobank data (for genetic fine-mapping) and the Sloan Digital Sky Survey (for astronomical point source detection) increased power by 30-120% in just a few minutes of additional computation. BLiP is implemented in pyblip (Python) and blipr (R).
In mobile computation offloading (MCO), mobile devices (MDs) can choose to either execute tasks locally or to have them executed on a remote edge server (ES). This paper addresses the problem of assigning both the wireless communication bandwidth needed, along with the ES capacity that is used for the task execution, so that task completion time constraints are satisfied. The objective is to obtain these allocations so that the average power consumption of the mobile devices is minimized, subject to a cost budget constraint. The paper includes contributions for both soft and hard task completion deadline constraints. The problems are first formulated as mixed integer nonlinear programs (MINLPs). Approximate solutions are then obtained by decomposing the problems into a collection of convex subproblems that can be efficiently solved. Results are presented that demonstrate the quality of the proposed solutions, which can achieve near optimum performance over a wide range of system parameters.
Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive DomainBed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In this paper, we propose Diverse Weight Averaging (DiWA), a new WA strategy whose main motivation is to increase the functional diversity across averaged models. To this end, DiWA averages weights obtained from several independent training runs: indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between WA and standard functional ensembling. Moreover, this decomposition highlights that WA succeeds when the variance term dominates, which we show occurs when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on DomainBed without inference overhead.
Fundamental differences between materials originate from the unique nature of their constituent chemical elements. Before specific differences emerge according to the precise ratios of elements in a given crystal structure, a material can be represented by the set of its constituent chemical elements. By working at the level of the periodic table, assessment of materials at the level of their phase fields reduces the combinatorial complexity to accelerate screening, and circumvents the challenges associated with composition-level approaches such as poor extrapolation within phase fields, and the impossibility of exhaustive sampling. This early stage discrimination combined with evaluation of novelty of phase fields aligns with the outstanding experimental challenge of identifying new areas of chemistry to investigate, by prioritising which elements to combine in a reaction. Here, we demonstrate that phase fields can be assessed with respect to the maximum expected value of a target functional property and ranked according to chemical novelty. We develop and present PhaseSelect, an end-to-end machine learning model that combines the representation, classification, regression and ranking of phase fields. First, PhaseSelect constructs elemental characteristics from the co-occurrence of chemical elements in computationally and experimentally reported materials, then it employs attention mechanisms to learn representation for phase fields and assess their functional performance. At the level of the periodic table, PhaseSelect quantifies the probability of observing a functional property, estimates its value within a phase field and also ranks a phase field novelty, which we demonstrate with significant accuracy for three avenues of materials applications for high-temperature superconductivity, high-temperature magnetism, and targeted bandgap energy.