This paper considers the classic Online Steiner Forest problem where one is given a (weighted) graph $G$ and an arbitrary set of $k$ terminal pairs $\{\{s_1,t_1\},\ldots ,\{s_k,t_k\}\}$ that are required to be connected. The goal is to maintain a minimum-weight sub-graph that satisfies all the connectivity requirements as the pairs are revealed one by one. It has been known for a long time that no algorithm (even randomized) can be better than $\Omega(\log(k))$-competitive for this problem. Interestingly, a simple greedy algorithm is already very efficient for this problem. This algorithm can be informally described as follows: Upon arrival of a new pair $\{s_i,t_i\}$, connect $s_i$ and $t_i$ with the shortest path in the current metric, contract the metric along the chosen path and wait for the next pair. Although simple and intuitive, greedy proved itself challenging to analyze and its competitive ratio is a long-standing open problem in the area of online algorithms. The last progress on this question is due to an elegant analysis by Awerbuch, Azar, and Bartal [SODA~1996], who showed that greedy is $O(\log^2(k))$-competitive. Our main result is to show that greedy is in fact $O(\log(k)\log\log(k))$-competitive on a wide class of instances. In particular, this wide class of instances contains all the instances that were exhibited in the literature until now.
The saddlepoint approximation gives an approximation to the density of a random variable in terms of its moment generating function. When the underlying random variable is itself the sum of $n$ unobserved i.i.d. terms, the basic classical result is that the relative error in the density is of order $1/n$. If instead the approximation is interpreted as a likelihood and maximised as a function of model parameters, the result is an approximation to the maximum likelihood estimate (MLE) that can be much faster to compute than the true MLE. This paper proves the analogous basic result for the approximation error between the saddlepoint MLE and the true MLE: subject to certain explicit identifiability conditions, the error has asymptotic size $O(1/n^2)$ for some parameters, and $O(1/n^{3/2})$ or $O(1/n)$ for others. In all three cases, the approximation errors are asymptotically negligible compared to the inferential uncertainty. The proof is based on a factorisation of the saddlepoint likelihood into an exact and approximate term, along with an analysis of the approximation error in the gradient of the log-likelihood. This factorisation also gives insight into alternatives to the saddlepoint approximation, including a new and simpler saddlepoint approximation, for which we derive analogous error bounds. As a corollary of our results, we also obtain the asymptotic size of the MLE error approximation when the saddlepoint approximation is replaced by the normal approximation.
We introduce Stochastic Asymptotical Regularization (SAR) methods for the uncertainty quantification of the stable approximate solution of ill-posed linear-operator equations, which are deterministic models for numerous inverse problems in science and engineering. We prove the regularizing properties of SAR with regard to mean-square convergence. We also show that SAR is an optimal-order regularization method for linear ill-posed problems provided that the terminating time of SAR is chosen according to the smoothness of the solution. This result is proven for both a priori and a posteriori stopping rules under general range-type source conditions. Furthermore, some converse results of SAR are verified. Two iterative schemes are developed for the numerical realization of SAR, and the convergence analyses of these two numerical schemes are also provided. A toy example and a real-world problem of biosensor tomography are studied to show the accuracy and the advantages of SAR: compared with the conventional deterministic regularization approaches for deterministic inverse problems, SAR can provide the uncertainty quantification of the quantity of interest, which can in turn be used to reveal and explicate the hidden information about real-world problems, usually obscured by the incomplete mathematical modeling and the ascendence of complex-structured noise.
We propose Streaming Bandits, a Restless Multi Armed Bandit (RMAB) framework in which heterogeneous arms may arrive and leave the system after staying on for a finite lifetime. Streaming Bandits naturally capture the health intervention planning problem, where health workers must manage the health outcomes of a patient cohort while new patients join and existing patients leave the cohort each day. Our contributions are as follows: (1) We derive conditions under which our problem satisfies indexability, a precondition that guarantees the existence and asymptotic optimality of the Whittle Index solution for RMABs. We establish the conditions using a polytime reduction of the Streaming Bandit setup to regular RMABs. (2) We further prove a phenomenon that we call index decay, whereby the Whittle index values are low for short residual lifetimes driving the intuition underpinning our algorithm. (3) We propose a novel and efficient algorithm to compute the index-based solution for Streaming Bandits. Unlike previous methods, our algorithm does not rely on solving the costly finite horizon problem on each arm of the RMAB, thereby lowering the computational complexity compared to existing methods. (4) Finally, we evaluate our approach via simulations run on realworld data sets from a tuberculosis patient monitoring task and an intervention planning task for improving maternal healthcare, in addition to other synthetic domains. Across the board, our algorithm achieves a 2-orders-of-magnitude speed-up over existing methods while maintaining the same solution quality.
We study the two inference problems of detecting and recovering an isolated community of \emph{general} structure planted in a random graph. The detection problem is formalized as a hypothesis testing problem, where under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n,q)$ with edge density $q\in(0,1)$; under the alternative, there is an unknown structure $\Gamma_k$ on $k$ nodes, planted in $\mathcal{G}(n,q)$, such that it appears as an \emph{induced subgraph}. In case of a successful detection, we are concerned with the task of recovering the corresponding structure. For these problems, we investigate the fundamental limits from both the statistical and computational perspectives. Specifically, we derive lower bounds for detecting/recovering the structure $\Gamma_k$ in terms of the parameters $(n,k,q)$, as well as certain properties of $\Gamma_k$, and exhibit computationally unbounded optimal algorithms that achieve these lower bounds. We also consider the problem of testing in polynomial-time. As is customary in many similar structured high-dimensional problems, our model undergoes an "easy-hard-impossible" phase transition and computational constraints can severely penalize the statistical performance. To provide an evidence for this phenomenon, we show that the class of low-degree polynomials algorithms match the statistical performance of the polynomial-time algorithms we develop.
Join query evaluation with ordering is a fundamental data processing task in relational database management systems. SQL and custom graph query languages such as Cypher offer this functionality by allowing users to specify the order via the ORDER BY clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the LIMIT clause), but the value of $k$ is not predetermined as user queries are arriving in an online fashion. Recent work has made considerable progress in identifying optimal algorithms for ranked enumeration of join queries that do not contain any projections. In this paper, we initiate the study of the problem of enumerating results in ranked order for queries with projections. Our main result shows that for any acyclic query, it is possible to obtain a near-linear (in the size of the database) delay algorithm after only a linear time preprocessing step for two important ranking functions: sum and lexicographic ordering. For a practical subset of acyclic queries known as star queries, we show an even stronger result that allows a user to obtain a smooth tradeoff between faster answering time guarantees using more preprocessing time. Our results are also extensible to queries containing cycles and unions. We also perform a comprehensive experimental evaluation to demonstrate that our algorithms, which are simple to implement, improve up to three orders of magnitude in the running time over state-of-the-art algorithms implemented within open-source RDBMS and specialized graph databases.
We analyze the orthogonal greedy algorithm when applied to dictionaries $\mathbb{D}$ whose convex hull has small entropy. We show that if the metric entropy of the convex hull of $\mathbb{D}$ decays at a rate of $O(n^{-\frac{1}{2}-\alpha})$ for $\alpha > 0$, then the orthogonal greedy algorithm converges at the same rate on the variation space of $\mathbb{D}$. This improves upon the well-known $O(n^{-\frac{1}{2}})$ convergence rate of the orthogonal greedy algorithm in many cases, most notably for dictionaries corresponding to shallow neural networks. These results hold under no additional assumptions on the dictionary beyond the decay rate of the entropy of its convex hull. In addition, they are robust to noise in the target function and can be extended to convergence rates on the interpolation spaces of the variation norm. We show empirically that the predicted rates are obtained for the dictionary corresponding to shallow neural networks with Heaviside activation function in two dimensions. Finally, we show that these improved rates are sharp and prove a negative result showing that the iterates generated by the orthogonal greedy algorithm cannot in general be bounded in the variation norm of $\mathbb{D}$.
Consider a set $P$ of $n$ points in $\mathbb{R}^d$. In the discrete median line segment problem, the objective is to find a line segment bounded by a pair of points in $P$ such that the sum of the Euclidean distances from $P$ to the line segment is minimized. In the continuous median line segment problem, a real number $\ell>0$ is given, and the goal is to locate a line segment of length $\ell$ in $\mathbb{R}^d$ such that the sum of the Euclidean distances between $P$ and the line segment is minimized. To begin with, we show how to compute $(1+\epsilon\Delta)$- and $(1+\epsilon)$-approximations to a discrete median line segment in time $O(n\epsilon^{-2d}\log n)$ and $O(n^2\epsilon^{-d})$, respectively, where $\Delta$ is the spread of line segments spanned by pairs of points. While developing our algorithms, by using the principle of pair decomposition, we derive new data structures that allow us to quickly approximate the sum of the distances from a set of points to a given line segment or point. To our knowledge, our utilization of pair decompositions for solving minsum facility location problems is the first of its kind -- it is versatile and easily implementable. Furthermore, we prove that it is impossible to construct a continuous median line segment for $n\geq3$ non-collinear points in the plane by using only ruler and compass. In view of this, we present an $O(n^d\epsilon^{-d})$-time algorithm for approximating a continuous median line segment in $\mathbb{R}^d$ within a factor of $1+\epsilon$. The algorithm is based upon generalizing the point-segment pair decomposition from the discrete to the continuous domain. Last but not least, we give an $(1+\epsilon)$-approximation algorithm, whose time complexity is sub-quadratic in $n$, for solving the constrained median line segment problem in $\mathbb{R}^2$ where an endpoint or the slope of the median line segment is given at input.
The satisfaction probability $\sigma(\phi) := \Pr_{\beta:\mathrm{vars}(\phi) \to \{0,1\}}[\beta\models \phi]$ of a propositional formula $\phi$ is the likelihood that a random assignment $\beta$ makes the formula true. We study the complexity of the problem $k$sat-prob$_{>\delta} = \{ \phi$ is a $k\mathrm{cnf}$ formula $\mid \sigma(\phi) > \delta\}$ for fixed $k$ and $\delta$. While 3sat-prob$_{>0}$ = 3sat is NP-complete and sat-prob$_{>1/2}$ is PP-complete, Akmal and Williams recently showed 3sat-prob$_{>1/2} \in$ P and 4sat-prob$_{>1/2} \in$ NP-complete; but the methods used to prove these striking results stay silent about, say, 4sat-prob$_{>1/3}$, leaving the computational complexity of $k$sat-prob$_{>\delta}$ open for most $k$ and $\delta$. In the present paper we give a complete characterization in the form of a trichotomy: $k$sat-prob$_{>\delta}$ lies in AC$^0$, is NL-complete, or is NP-complete; and given $k$ and $\delta$ we can decide which of the three applies. The proof of the trichotomy hinges on a new order-theoretic insight: Every set of $k$cnf formulas contains a formula of maximum satisfaction probability. This deceptively simple result allows us to (1) kernelize $k$sat-prob$_{\ge \delta}$, (2) show that the variables of the kernel form a strong backdoor set when the trichotomy states membership in AC$^0$ or NL, and (3) prove a new locality property for the models of second-order formulas that describe problems like $k$sat-prob$_{\ge \delta}$. The locality property will allow us to prove a conjecture of Akmal and Williams: The majority-of-majority satisfaction problem for $k$cnfs lies in P for all $k$.
We consider a participatory budgeting problem in which each voter submits a proposal for how to divide a single divisible resource (such as money or time) among several possible alternatives (such as public projects or activities) and these proposals must be aggregated into a single aggregate division. Under $\ell_1$ preferences -- for which a voter's disutility is given by the $\ell_1$ distance between the aggregate division and the division he or she most prefers -- the social welfare-maximizing mechanism, which minimizes the average $\ell_1$ distance between the outcome and each voter's proposal, is incentive compatible (Goel et al. 2016). However, it fails to satisfy the natural fairness notion of proportionality, placing too much weight on majority preferences. Leveraging a connection between market prices and the generalized median rules of Moulin (1980), we introduce the independent markets mechanism, which is both incentive compatible and proportional. We unify the social welfare-maximizing mechanism and the independent markets mechanism by defining a broad class of moving phantom mechanisms that includes both. We show that every moving phantom mechanism is incentive compatible. Finally, we characterize the social welfare-maximizing mechanism as the unique Pareto-optimal mechanism in this class, suggesting an inherent tradeoff between Pareto optimality and proportionality.
In this work, we compare three different modeling approaches for the scores of soccer matches with regard to their predictive performances based on all matches from the four previous FIFA World Cups 2002 - 2014: Poisson regression models, random forests and ranking methods. While the former two are based on the teams' covariate information, the latter method estimates adequate ability parameters that reflect the current strength of the teams best. Within this comparison the best-performing prediction methods on the training data turn out to be the ranking methods and the random forests. However, we show that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate we can improve the predictive power substantially. Finally, this combination of methods is chosen as the final model and based on its estimates, the FIFA World Cup 2018 is simulated repeatedly and winning probabilities are obtained for all teams. The model slightly favors Spain before the defending champion Germany. Additionally, we provide survival probabilities for all teams and at all tournament stages as well as the most probable tournament outcome.