We extend intersection types to a computational $\lambda$-calculus with algebraic operations \`a la Plotkin and Power. We achieve this by considering monadic intersections, whereby computational effects appear not only in the operational semantics, but also in the type system. Since in the effectful setting termination is not anymore the only property of interest, we want to analyze the interactive behavior of typed programs with the environment. Indeed, our type system is able to characterize the natural notion of observation, both in the finite and in the infinitary setting, and for a wide class of effects, such as output, cost, pure and probabilistic nondeterminism, and combinations thereof. The main technical tool is a novel combination of syntactic techniques with abstract relational reasoning, which allows us to lift all the required notions, e.g. of typability and logical relation, to the monadic setting.
The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
Automatically generating invariants, key to computer-aided analysis of probabilistic and deterministic programs and compiler optimisation, is a challenging open problem. Whilst the problem is in general undecidable, the goal is settled for restricted classes of loops. For the class of solvable loops, introduced by Kapur and Rodr\'iguez-Carbonell in 2004, one can automatically compute invariants from closed-form solutions of recurrence equations that model the loop behaviour. In this paper we establish a technique for invariant synthesis for loops that are not solvable, termed unsolvable loops. Our approach automatically partitions the program variables and identifies the so-called defective variables that characterise unsolvability. Herein we consider the following two applications. First, we present a novel technique that automatically synthesises polynomials from defective monomials, that admit closed-form solutions and thus lead to polynomial loop invariants. Second, given an unsolvable loop, we synthesise solvable loops with the following property: the invariant polynomials of the solvable loops are all invariants of the given unsolvable loop. Our implementation and experiments demonstrate both the feasibility and applicability of our approach to both deterministic and probabilistic programs.
We study the complexity of approximating the number of answers to a small query $\varphi$ in a large database $\mathcal{D}$. We establish an exhaustive classification into tractable and intractable cases if $\varphi$ is a conjunctive query with disequalities and negations: $\bullet$ If there is a constant bound on the arity of $\varphi$, and if the randomised Exponential Time Hypothesis (rETH) holds, then the problem has a fixed-parameter tractable approximation scheme (FPTRAS) if and only if the treewidth of $\varphi$ is bounded. $\bullet$ If the arity is unbounded and we allow disequalities only, then the problem has an FPTRAS if and only if the adaptive width of $\varphi$ (a width measure strictly more general than treewidth) is bounded; the lower bound relies on the rETH as well. Additionally we show that our results cannot be strengthened to achieve a fully polynomial randomised approximation scheme (FPRAS): We observe that, unless $\mathrm{NP} =\mathrm{RP}$, there is no FPRAS even if the treewidth (and the adaptive width) is $1$. However, if there are neither disequalities nor negations, we prove the existence of an FPRAS for queries of bounded fractional hypertreewidth, strictly generalising the recently established FPRAS for conjunctive queries with bounded hypertreewidth due to Arenas, Croquevielle, Jayaram and Riveros (STOC 2021).
Emulators, or reduced complexity climate models, are surrogate Earth system models that produce projections of key climate quantities with minimal computational resources. Using time-series modelling or more advanced machine learning techniques, data-driven emulators have emerged as a promising avenue of research, producing spatially resolved climate responses that are visually indistinguishable from state-of-the-art Earth system models. Yet, their lack of physical interpretability limits their wider adoption. In this work, we introduce FaIRGP, a data-driven emulator that satisfies the physical temperature response equations of an energy balance model. The result is an emulator that \textit{(i)} enjoys the flexibility of statistical machine learning models and can learn from data, and \textit{(ii)} has a robust physical grounding with interpretable parameters that can be used to make inference about the climate system. Further, our Bayesian approach allows a principled and mathematically tractable uncertainty quantification. Our model demonstrates skillful emulation of global mean surface temperature and spatial surface temperatures across realistic future scenarios. Its ability to learn from data allows it to outperform energy balance models, while its robust physical foundation safeguards against the pitfalls of purely data-driven models. We also illustrate how FaIRGP can be used to obtain estimates of top-of-atmosphere radiative forcing and discuss the benefits of its mathematical tractability for applications such as detection and attribution or precipitation emulation. We hope that this work will contribute to widening the adoption of data-driven methods in climate emulation.
We propose principled Gaussian processes (GPs) for modeling functions defined over the edge set of a simplicial 2-complex, a structure similar to a graph in which edges may form triangular faces. This approach is intended for learning flow-type data on networks where edge flows can be characterized by the discrete divergence and curl. Drawing upon the Hodge decomposition, we first develop classes of divergence-free and curl-free edge GPs, suitable for various applications. We then combine them to create \emph{Hodge-compositional edge GPs} that are expressive enough to represent any edge function. These GPs facilitate direct and independent learning for the different Hodge components of edge functions, enabling us to capture their relevance during hyperparameter optimization. To highlight their practical potential, we apply them for flow data inference in currency exchange, ocean currents and water supply networks, comparing them to alternative models.
Differentially private computation often begins with a bound on some $d$-dimensional statistic's $\ell_p$ sensitivity. For pure differential privacy, the $K$-norm mechanism can improve on this approach using statistic-specific (and possibly non-$\ell_p$) norms. However, sampling such mechanisms requires sampling from the corresponding norm balls. These are $d$-dimensional convex polytopes, for which the fastest known general sampling algorithm takes time $\tilde O(d^{3+\omega})$, where $\omega \geq 2$ is the matrix multiplication exponent. For concentrated differential privacy, elliptic Gaussian noise offers similar improvement over spherical Gaussian noise, but the general method for computing the problem-specific elliptic noise requires solving a semidefinite program for each instance. This paper considers the simple problems of sum, count, and vote and provides faster algorithms in both settings. We construct optimal pure differentially private $K$-norm mechanism samplers and derive closed-form expressions for optimal concentrated differentially private elliptic Gaussian noise. Their runtimes are, respectively, $\tilde O(d^2)$ and $O(1)$, and the resulting algorithms all yield meaningful accuracy improvements. More broadly, we suggest that problem-specific sensitivity space analysis may be an overlooked tool for private additive noise.
We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.
Data stream algorithms tackle operations on high-volume sequences of read-once data items. Data stream scenarios include inherently real-time systems like sensor networks and financial markets. They also arise in purely-computational scenarios like ordered traversal of big data or long-running iterative simulations. In this work, we develop methods to maintain running archives of stream data that are temporally representative, a task we call "stream curation." Our approach contributes to rich existing literature on data stream binning, which we extend by providing stateless (i.e., non-iterative) curation schemes that enable key optimizations to trim archive storage overhead and streamline processing of incoming observations. We also broaden support to cover new trade-offs between curated archive size and temporal coverage. We present a suite of five stream curation algorithms that span $\mathcal{O}(n)$, $\mathcal{O}(\log n)$, and $\mathcal{O}(1)$ orders of growth for retained data items. Within each order of growth, algorithms are provided to maintain even coverage across history or bias coverage toward more recent time points. More broadly, memory-efficient stream curation can boost the data stream mining capabilities of low-grade hardware in roles such as sensor nodes and data logging devices.
We initiate the study of Local Computation Algorithms on average case inputs. In the Local Computation Algorithm (LCA) model, we are given probe access to a huge graph, and asked to answer membership queries about some combinatorial structure on the graph, answering each query with sublinear work. For instance, an LCA for the $k$-spanner problem gives access to a sparse subgraph $H\subseteq G$ that preserves distances up to a factor of $k$. We build simple LCAs for this problem assuming the input graph is drawn from the well-studied Erdos-Reyni and Preferential Attachment graph models. In both cases, our spanners achieve size and stretch tradeoffs that are impossible to achieve for general graphs, while having dramatically lower query complexity than worst-case LCAs. Our second result investigates the intersection of LCAs with Local Access Generators (LAGs). Local Access Generators provide efficient query access to a random object, for instance an Erdos Reyni random graph. We explore the natural problem of generating a random graph together with a combinatorial structure on it. We show that this combination can be easier to solve than focusing on each problem by itself, by building a fast, simple algorithm that provides access to an Erdos Reyni random graph together with a maximal independent set.
Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.