We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space $\mathcal{X}$ to $\mathbb{R}^d$ via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the $1$-Wasserstein space $\mathcal{P}_1(\mathbb{R}^D)$ via a probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014). Our model, called the \textit{probabilistic transformer (PT)}, can approximate any continuous function from $\mathbb{R}^d $ to $\mathcal{P}_1(\mathbb{R}^D)$ uniformly on compact sets, quantitatively. We identify two ways in which the PT avoids the curse of dimensionality when approximating $\mathcal{P}_1(\mathbb{R}^D)$-valued functions. The first strategy builds functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be efficiently approximated by a PT, uniformly on any given compact subset of $\mathbb{R}^d$. In the second approach, given any function $f$ in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$, we build compact subsets of $\mathbb{R}^d$ whereon $f$ can be efficiently approximated by a PT.
The Fr\'{e}chet distance is one of the most studied distance measures between curves $P$ and $Q$. The data structure variant of the problem is a longstanding open problem: Efficiently preprocess $P$, so that for any $Q$ given at query time, one can efficiently approximate their Fr\'{e}chet distance. There exist conditional lower bounds that prohibit $(1 + \varepsilon)$-approximate Fr\'{e}chet distance computations in subquadratic time, even when preprocessing $P$ using any polynomial amount of time and space. As a consequence, the problem has been studied under various restrictions: restricting $Q$ to be a (horizontal) segment, or requiring $P$ and $Q$ to be so-called \emph{realistic} input curves. We give a data structure for $(1+\varepsilon)$-approximate discrete Fr\'{e}chet distance in any metric space $\mathcal{X}$ between a realistic input curve $P$ and any query curve $Q$. After preprocessing the input curve $P$ (of length $|P|=n$) in $O(n \log n)$ time, we may answer queries specifying a query curve $Q$ and an $\varepsilon$, and output a value $d(P,Q)$ which is at most a $(1+\varepsilon)$-factor away from the true Fr\'{e}chet distance between $Q$ and $P$. Our query time is asymptotically linear in $|Q|=m$, $\frac{1}{\varepsilon}$, $\log n$, and the realism parameter $c$ or $\kappa$. Our data structure is the first to: adapt to the approximation parameter $\varepsilon$ at query time, handle query curves with arbitrarily many vertices, work for any ambient space of the curves, or be dynamic. The method presented in this paper simplifies and generalizes previous contributions to the static problem variant. We obtain efficient queries (and therefore static algorithms) for Fr\'{e}chet distance computation in high-dimensional spaces and other ambient metric spaces.
Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained to total variance can be summarized by the coefficient of determination or in-sample $R^2$, which is easy to interpret and to compare across different outcome variables. As opposed to the in-sample $R^2$, the out-of-sample $R^2$ has not been well defined and the variability on the out-of-sample $\hat{R}^2$ has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define the out-of-sample $R^2$ as a comparison of two predictive models, provide an unbiased estimator and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for the $\hat{R}^2$. The performance of the estimators for the $R^2$ and its standard error are investigated in a simulation study. We demonstrate our new method by constructing confidence intervals and comparing models for prediction of quantitative $\text{Brassica napus}$ and $\text{Zea mays}$ phenotypes based on gene expression data.
The ubiquity of distributed machine learning (ML) in sensitive public domain applications calls for algorithms that protect data privacy, while being robust to faults and adversarial behaviors. Although privacy and robustness have been extensively studied independently in distributed ML, their synthesis remains poorly understood. We present the first tight analysis of the error incurred by any algorithm ensuring robustness against a fraction of adversarial machines, as well as differential privacy (DP) for honest machines' data against any other curious entity. Our analysis exhibits a fundamental trade-off between privacy, robustness, and utility. Surprisingly, we show that the cost of this trade-off is marginal compared to that of the classical privacy-utility trade-off. To prove our lower bound, we consider the case of mean estimation, subject to distributed DP and robustness constraints, and devise reductions to centralized estimation of one-way marginals. We prove our matching upper bound by presenting a new distributed ML algorithm using a high-dimensional robust aggregation rule. The latter amortizes the dependence on the dimension in the error (caused by adversarial workers and DP), while being agnostic to the statistical properties of the data.
We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
Motivated by the increasing need for fast processing of large-scale graphs, we study a number of fundamental graph problems in a message-passing model for distributed computing, called $k$-machine model, where we have $k$ machines that jointly perform computations on $n$-node graphs. The graph is assumed to be partitioned in a balanced fashion among the $k$ machines, a common implementation in many real-world systems. Communication is point-to-point via bandwidth-constrained links, and the goal is to minimize the round complexity, i.e., the number of communication rounds required to finish a computation. We present a generic methodology that allows to obtain efficient algorithms in the $k$-machine model using distributed algorithms for the classical CONGEST model of distributed computing. Using this methodology, we obtain algorithms for various fundamental graph problems such as connectivity, minimum spanning trees, shortest paths, maximal independent sets, and finding subgraphs, showing that many of these problems can be solved in $\tilde{O}(n/k)$ rounds; this shows that one can achieve speedup nearly linear in $k$. To complement our upper bounds, we present lower bounds on the round complexity that quantify the fundamental limitations of solving graph problems distributively. We first show a lower bound of $\Omega(n/k)$ rounds for computing a spanning tree of the input graph. This result implies the same bound for other fundamental problems such as computing a minimum spanning tree, breadth-first tree, or shortest paths tree. We also show a $\tilde \Omega(n/k^2)$ lower bound for connectivity, spanning tree verification and other related problems. The latter lower bounds follow from the development and application of novel results in a random-partition variant of the classical communication complexity model.
We address the weak numerical solution of stochastic differential equations driven by independent Brownian motions (SDEs for short). This paper develops a new methodology to design adaptive strategies for determining automatically the step-sizes of the numerical schemes that compute the mean values of smooth functions of the solutions of SDEs. First, we introduce a general method for constructing variable step-size weak schemes for SDEs, which is based on controlling the match between the first conditional moments of the increments of the numerical integrator and the ones corresponding to an additional weak approximation. To this end, we use certain local discrepancy functions that do not involve sampling random variables. Precise directions for designing suitable discrepancy functions and for selecting starting step-sizes are given. Second, we introduce a variable step-size Euler scheme, together with a variable step-size second order weak scheme via extrapolation. Finally, numerical simulations are presented to show the potential of the introduced variable step-size strategy and the adaptive scheme to overcome known instability problems of the conventional fixed step-size schemes in the computation of diffusion functional expectations.
Federated learning (FL) is increasingly deployed among multiple clients (e.g., mobile devices) to train a shared model over decentralized data. To address the privacy concerns, FL systems need to protect the clients' data from being revealed during training, and also control the data leakage through trained models when exposed to untrusted domains. Distributed differential privacy (DP) offers an appealing solution in this regard as it achieves an informed tradeoff between privacy and utility without a trusted server. However, existing distributed DP mechanisms work impractically in the presence of client dropout, resulting in either poor privacy guarantees or degraded training accuracy. In addition, these mechanisms also suffer from severe efficiency issues with long time-to-accuracy training performance. We present Hyades, a distributed differentially private FL framework that is highly efficient and resilient to client dropout. Specifically, we develop a novel 'add-then-remove' scheme where a required noise level can be enforced in each FL training round even though some sampled clients may drop out in the end; therefore, the privacy budget is consumed carefully even in the presence of client dropout. To boost performance, Hyades runs as a distributed pipeline architecture via encapsulating the communication and computation operations into stages. It automatically divides the global model aggregation into several chunk-aggregation tasks and pipelines them for optimal speedup. Evaluation through large-scale cloud deployment shows that Hyades can efficiently handle client dropout in various realistic FL scenarios, attaining the optimal privacy-utility tradeoff and accelerating the training by up to 2.1$\times$ compared to existing solutions.
We study the distributed multi-user secret sharing (DMUSS) problem under the perfect privacy condition. In a DMUSS problem, multiple secret messages are deployed and the shares are offloaded to the storage nodes. Moreover, the access structure is extremely incomplete, as the decoding collection of each secret message has only one set, and by the perfect privacy condition such collection is also the colluding collection of all other secret messages. The secret message rate is defined as the size of the secret message normalized by the size of a share. We characterize the capacity region of the DMUSS problem when given an access structure, defined as the set of all achievable rate tuples. In the achievable scheme, we assume all shares are mutually independent and then design the decoding function based on the fact that the decoding collection of each secret message has only one set. Then it turns out that the perfect privacy condition is equivalent to the full rank property of some matrices consisting of different indeterminates and zeros. Such a solution does exist if the field size is bigger than the number of secret messages. Finally with a matching converse saying that the size of the secret is upper bounded by the sum of sizes of non-colluding shares, we characterize the capacity region of DMUSS problem under the perfect privacy condition.
Differential private optimization for nonconvex smooth objective is considered. In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter. To improve the best known utility bound, we propose a new differential private optimization framework called \emph{DIFF2 (DIFFerential private optimization via gradient DIFFerences)} that constructs a differential private global gradient estimator with possibly quite small variance based on communicated \emph{gradient differences} rather than gradients themselves. It is shown that DIFF2 with a gradient descent subroutine achieves the utility of $\widetilde O(d^{2/3}/(n\varepsilon_\mathrm{DP})^{4/3})$, which can be significantly better than the previous one in terms of the dependence on the sample size $n$. To the best of our knowledge, this is the first fundamental result to improve the standard utility $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ for nonconvex objectives. Additionally, a more computational and communication efficient subroutine is combined with DIFF2 and its theoretical analysis is also given. Numerical experiments are conducted to validate the superiority of DIFF2 framework.
Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.