We describe a new dependent-rounding algorithmic framework for bipartite graphs. Given a fractional assignment $y$ of values to edges of graph $G = (U \cup V, E)$, the algorithms return an integral solution $Y$ such that each right-node $v \in V$ has at most one neighboring edge $f$ with $Y_f = 1$, and where the variables $Y_e$ also satisfy broad nonpositive-correlation properties. In particular, for any edges $e_1, e_2$ sharing a left-node $u \in U$, the variables $Y_{e_1}, Y_{e_2}$ have strong negative-correlation properties, i.e. the expectation of $Y_{e_1} Y_{e_2}$ is significantly below $y_{e_1} y_{e_2}$. This algorithm is a refinement of a dependent-rounding algorithm of Im \& Shadloo (2020) based on simulation of Poisson processes. Our algorithm allows greater flexibility, in particular, it allows ``irregular'' fractional assignments, and it gives more refined bounds on the negative correlation. Dependent rounding schemes with negative correlation properties have been used for approximation algorithms for job-scheduling on unrelated machines to minimize weighted completion times (Bansal, Srinivasan, & Svensson (2021), Im & Shadloo (2020), Im & Li (2023)). Using our new dependent-rounding algorithm, among other improvements, we obtain a $1.407$-approximation for this problem. This significantly improves over the prior $1.45$-approximation ratio of Im & Li (2023).
Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
Recently, Sato et al. proposed an public verifiable blind quantum computation (BQC) protocol by inserting a third-party arbiter. However, it is not true public verifiable in a sense, because the arbiter is determined in advance and participates in the whole process. In this paper, a public verifiable protocol for measurement-only BQC is proposed. The fidelity between arbitrary states and the graph states of 2-colorable graphs is estimated by measuring the entanglement witnesses of the graph states,so as to verify the correctness of the prepared graph states. Compared with the previous protocol, our protocol is public verifiable in the true sense by allowing other random clients to execute the public verification. It also has greater advantages in the efficiency, where the number of local measurements is O(n^3*log {n}) and graph states' copies is O(n^2*log{n}).
We present a new and straightforward derivation of a family $\mathcal{F}(h,\tau)$ of exponential splittings of Strang-type for the general linear evolutionary equation with two linear components. One component is assumed to be a time-independent, unbounded operator, while the other is a bounded one with explicit time dependence. The family $\mathcal{F}(h,\tau)$ is characterized by the length of the time-step $h$ and a continuous parameter $\tau$, which defines each member of the family. It is shown that the derivation and error analysis follows from two elementary arguments: the variation of constants formula and specific quadratures for integrals over simplices. For these Strang-type splittings, we prove their convergence which, depending on some commutators of the relevant operators, may be of first or second order. As a result, error bounds appear in terms of commutator bounds. Based on the explicit form of the error terms, we establish the influence of $\tau$ on the accuracy of $\mathcal{F}(h,\tau)$, allowing us to investigate the optimal value of $\tau$. This simple yet powerful approach establishes the connection between exponential integrators and splitting methods. Furthermore, the present approach can be easily applied to the derivation of higher-order splitting methods under similar considerations. Needless to say, the obtained results also apply to Strang-type splittings in the case of time independent-operators. To complement rigorous results, we present numerical experiments with various values of $\tau$ based on the linear Schr\"odinger equation.
Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. In the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV.
Fitted finite element methods are constructed for a singularly perturbed convection-diffusion problem in two space dimensions. Exponential splines as basis functions are combined with Shishkin meshes to obtain a stable parameter-uniform numerical method. These schemes satisfy a discrete maximum principle. In the classical case, the numerical approximations converge, in the maximum pointwise norm, at a rate of second order and the approximations converge at a rate of first order for all values of the singular perturbation parameter.
We consider the problem of fitting a centered ellipsoid to $n$ standard Gaussian random vectors in $\mathbb{R}^d$, as $n, d \to \infty$ with $n/d^2 \to \alpha > 0$. It has been conjectured that this problem is, with high probability, satisfiable (SAT; that is, there exists an ellipsoid passing through all $n$ points) for $\alpha < 1/4$, and unsatisfiable (UNSAT) for $\alpha > 1/4$. In this work we give a precise analytical argument, based on the non-rigorous replica method of statistical physics, that indeed predicts a SAT/UNSAT transition at $\alpha = 1/4$, as well as the shape of a typical fitting ellipsoid in the SAT phase (i.e., the lengths of its principal axes). Besides the replica method, our main tool is the dilute limit of extensive-rank "HCIZ integrals" of random matrix theory. We further study different explicit algorithmic constructions of the matrix characterizing the ellipsoid. In particular, we show that a procedure based on minimizing its nuclear norm yields a solution in the whole SAT phase. Finally, we characterize the SAT/UNSAT transition for ellipsoid fitting of a large class of rotationally-invariant random vectors. Our work suggests mathematically rigorous ways to analyze fitting ellipsoids to random vectors, which is the topic of a companion work.
We propose a new method to compare survival data based on Higher Criticism (HC) of P-values obtained from many exact hypergeometric tests. The method can accommodate censorship and is sensitive to moderate differences in some unknown and relatively few time intervals, attaining much better power against such differences than the log-rank test and other tests that are popular under non-proportional hazard alternatives. We demonstrate the usefulness of the HC-based test in detecting rare differences compared to existing tests using simulated data and using actual gene expression data. Additionally, we analyze the asymptotic power of our method under a piece-wise homogeneous exponential decay model with rare and weak departures, describing two groups experiencing failure rates that are usually identical over time except in a few unknown instances in which the second group's failure rate is higher. Under an asymptotic calibration of the model's parameters, the HC-based test's power experiences a phase transition across the plane involving the rarity and intensity parameters that mirrors the phase transition in a two-sample rare and weak normal means setting. In particular, the phase transition curve of our test indicates a larger region in which it is fully powered than the corresponding region of the log-rank test. %The latter attains a phase transition curve that is analogous to a test based on Fisher's combination statistic of the hypergeometric P-values. %To our knowledge, this is the first analysis of a rare and weak signal detection model that involves individually dependent effects in a non-Gaussian setting.
The alpha complex is a fundamental data structure from computational geometry, which encodes the topological type of a union of balls $B(x; r) \subset \mathbb{R}^m$ for $x\in S$, including a weighted version that allows for varying radii. It consists of the collection of "simplices" $\sigma = \{x_0, ..., x_k \} \subset S$, which correspond to nomempty $(k + 1)$-fold intersections of cells in a radius-restricted version of the Voronoi diagram. Existing algorithms for computing the alpha complex require that the points reside in low dimension because they begin by computing the entire Delaunay complex, which rapidly becomes intractable, even when the alpha complex is of a reasonable size. This paper presents a method for computing the alpha complex without computing the full Delaunay triangulation by applying Lagrangian duality, specifically an algorithm based on dual quadratic programming that seeks to rule simplices out rather than ruling them in.
I propose an alternative algorithm to compute the MMS voting rule. Instead of using linear programming, in this new algorithm the maximin support value of a committee is computed using a sequence of maximum flow problems.
We introduce a flexible method to simultaneously infer both the drift and volatility functions of a discretely observed scalar diffusion. We introduce spline bases to represent these functions and develop a Markov chain Monte Carlo algorithm to infer, a posteriori, the coefficients of these functions in the spline basis. A key innovation is that we use spline bases to model transformed versions of the drift and volatility functions rather than the functions themselves. The output of the algorithm is a posterior sample of plausible drift and volatility functions that are not constrained to any particular parametric family. The flexibility of this approach provides practitioners a powerful investigative tool, allowing them to posit a variety of parametric models to better capture the underlying dynamics of their processes of interest. We illustrate the versatility of our method by applying it to challenging datasets from finance, paleoclimatology, and astrophysics. In view of the parametric diffusion models widely employed in the literature for those examples, some of our results are surprising since they call into question some aspects of these models.