We study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $\varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm breaks this barrier using only $\Theta(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.
We consider a boundary value problem (BVP) modelling one-dimensional heat-conduction with radiation, which is derived from the Stefan-Boltzmann law. The problem strongly depends on the parameters, making difficult to estimate the solution. We use an analytical approach to determine upper and lower bounds to the exact solution of the BVP, which allows estimating the latter. Finally, we support our theoretical arguments with numerical data, by implementing them into the MAPLE computer program.
The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (optimization variables). The requirement is not met when parameters comprise both discrete and continuous variables, making the convergence analysis nontrivial. This paper introduces a set of conditions that ensure the convergence of a specific class of EM algorithms that estimate a mixture of discrete and continuous parameters. Our results offer a new analysis technique for iterative algorithms that solve mixed-integer non-linear optimization problems. As a concrete example, we prove the convergence of the EM-based sparse Bayesian learning algorithm in [1] that estimates the state of a linear dynamical system with jointly sparse inputs and bursty missing observations. Our results establish that the algorithm in [1] converges to the set of stationary points of the maximum likelihood cost with respect to the continuous optimization variables.
This work focuses on non-adaptive group testing, with a primary goal of efficiently identifying a set of at most $d$ defective elements among a given set of elements using the fewest possible number of tests. Non-adaptive combinatorial group testing often employs disjunctive codes and union-free codes. This paper discusses union-free codes with fast decoding (UFFD codes), a recently introduced class of union-free codes that combine the best of both worlds -- the linear complexity decoding of disjunctive codes and the fewest number of tests of union-free codes. In our study, we distinguish two subclasses of these codes -- one subclass, denoted as $(=d)$-UFFD codes, can be used when the number of defectives $d$ is a priori known, whereas $(\le d)$-UFFD codes works for any subset of at most $d$ defectives. Previous studies have established a lower bound on the rate of these codes for $d=2$. Our contribution lies in deriving new lower bounds on the rate for both $(=d)$- and $(\le d)$-UFFD codes for an arbitrary number $d \ge 2$ of defectives. Our results show that for $d\to\infty$, the rate of $(=d)$-UFFD codes is twice as large as the best-known lower bound on the rate of $d$-disjunctive codes. In addition, the rate of $(\le d)$-UFFD code is shown to be better than the known lower bound on the rate of $d$-disjunctive codes for small values of $d$.
Given that no existing graph construction method can generate a perfect graph for a given dataset, graph-based algorithms are invariably affected by the plethora of redundant and erroneous edges present within the constructed graphs. In this paper, we propose treating these noisy edges as adversarial attack and use a spectral adversarial robustness evaluation method to diminish the impact of noisy edges on the performance of graph algorithms. Our method identifies those points that are less vulnerable to noisy edges and leverages only these robust points to perform graph-based algorithms. Our experiments with spectral clustering, one of the most representative and widely utilized graph algorithms, reveal that our methodology not only substantially elevates the precision of the algorithm but also greatly accelerates its computational efficiency by leveraging only a select number of robust data points.
Matrix completion is one of the crucial tools in modern data science research. Recently, a novel sampling model for matrix completion coined cross-concentrated sampling (CCS) has caught much attention. However, the robustness of the CCS model against sparse outliers remains unclear in the existing studies. In this paper, we aim to answer this question by exploring a novel Robust CCS Completion problem. A highly efficient non-convex iterative algorithm, dubbed Robust CUR Completion (RCURC), is proposed. The empirical performance of the proposed algorithm, in terms of both efficiency and robustness, is verified in synthetic and real datasets.
Missing data is a pernicious problem in epidemiologic research. Research on the validity of complete case analysis for missing data has typically focused on estimating the average treatment effect (ATE) in the whole population. However, other target populations like the treated (ATT) or external targets can be of substantive interest. In such cases, whether missing covariate data occurs within or outside the target population may impact the validity of complete case analysis. We sought to assess bias in complete case analysis when covariate data is missing outside the target (e.g., missing covariate data among the untreated when estimating the ATT). We simulated a study of the effect of a binary treatment X on a binary outcome Y in the presence of 3 confounders C1-C3 that modified the risk difference (RD). We induced missingness in C1 only among the untreated under 4 scenarios: completely randomly (similar to MCAR); randomly based on C2 and C3 (similar to MAR); randomly based on C1 (similar to MNAR); or randomly based on Y (similar to MAR). We estimated the ATE and ATT using weighting and averaged results across the replicates. We conducted a parallel simulation transporting trial results to a target population in the presence of missing covariate data in the trial. In the complete case analysis, estimated ATE was unbiased only when C1 was MCAR among the untreated. The estimated ATT, on the other hand, was unbiased in all scenarios except when Y caused missingness. The parallel simulation of generalizing and transporting trial results saw similar bias patterns. If missing covariate data is only present outside the target population, complete case analysis is unbiased except when missingness is associated with the outcome.
In this work, we construct $4$-phase Golay complementary sequence (GCS) set of cardinality $2^{3+\lceil \log_2 r \rceil}$ with arbitrary sequence length $n$, where the $10^{13}$-base expansion of $n$ has $r$ nonzero digits. Specifically, the GCS octets (eight sequences) cover all the lengths no greater than $10^{13}$. Besides, based on the representation theory of signed symmetric group, we construct Hadamard matrices from some special GCS to improve their asymptotic existence: there exist Hadamard matrices of order $2^t m$ for any odd number $m$, where $t = 6\lfloor \frac{1}{40}\log_{2}m\rfloor + 10$.
We prove that to each real singularity $f: (\mathbb{R}^{n+1}, 0) \to (\mathbb{R}, 0)$ one can associate two systems of differential equations $\mathfrak{g}^{k\pm}_f$ which are pushforwards in the category of $\mathcal{D}$-modules over $\mathbb{R}^{\pm}$, of the sheaf of real analytic functions on the total space of the positive, respectively negative, Milnor fibration. We prove that for $k=0$ if $f$ is an isolated singularity then $\mathfrak{g}^{\pm}$ determines the the $n$-th homology groups of the positive, respectively negative, Milnor fibre. We then calculate $\mathfrak{g}^{+}$ for ordinary quadratic singularities and prove that under certain conditions on the choice of morsification, one recovers the top homology groups of the Milnor fibers of any isolated singularity $f$. As an application we construct a public-key encryption scheme based on morsification of singularities.
Markov processes are widely used mathematical models for describing dynamic systems in various fields. However, accurately simulating large-scale systems at long time scales is computationally expensive due to the short time steps required for accurate integration. In this paper, we introduce an inference process that maps complex systems into a simplified representational space and models large jumps in time. To achieve this, we propose Time-lagged Information Bottleneck (T-IB), a principled objective rooted in information theory, which aims to capture relevant temporal features while discarding high-frequency information to simplify the simulation task and minimize the inference error. Our experiments demonstrate that T-IB learns information-optimal representations for accurately modeling the statistical properties and dynamics of the original process at a selected time lag, outperforming existing time-lagged dimensionality reduction methods.
Consider the supervised learning setting where the goal is to learn to predict labels $\mathbf y$ given points $\mathbf x$ from a distribution. An \textit{omnipredictor} for a class $\mathcal L$ of loss functions and a class $\mathcal C$ of hypotheses is a predictor whose predictions incur less expected loss than the best hypothesis in $\mathcal C$ for every loss in $\mathcal L$. Since the work of [GKR+21] that introduced the notion, there has been a large body of work in the setting of binary labels where $\mathbf y \in \{0, 1\}$, but much less is known about the regression setting where $\mathbf y \in [0,1]$ can be continuous. Our main conceptual contribution is the notion of \textit{sufficient statistics} for loss minimization over a family of loss functions: these are a set of statistics about a distribution such that knowing them allows one to take actions that minimize the expected loss for any loss in the family. The notion of sufficient statistics relates directly to the approximate rank of the family of loss functions. Our key technical contribution is a bound of $O(1/\varepsilon^{2/3})$ on the $\epsilon$-approximate rank of convex, Lipschitz functions on the interval $[0,1]$, which we show is tight up to a factor of $\mathrm{polylog} (1/\epsilon)$. This yields improved runtimes for learning omnipredictors for the class of all convex, Lipschitz loss functions under weak learnability assumptions about the class $\mathcal C$. We also give efficient omnipredictors when the loss families have low-degree polynomial approximations, or arise from generalized linear models (GLMs). This translation from sufficient statistics to faster omnipredictors is made possible by lifting the technique of loss outcome indistinguishability introduced by [GKH+23] for Boolean labels to the regression setting.