In this paper, we put forward the model of zero-error distributed function compression system of two binary memoryless sources X and Y, where there are two encoders En1 and En2 and one decoder De, connected by two channels (En1, De) and (En2, De) with the capacity constraints C1 and C2, respectively. The encoder En1 can observe X or (X,Y) and the encoder En2 can observe Y or (X,Y) according to the two switches s1 and s2 open or closed (corresponding to taking values 0 or 1). The decoder De is required to compress the binary arithmetic sum f(X,Y)=X+Y with zero error by using the system multiple times. We use (s1s2;C1,C2;f) to denote the model in which it is assumed that C1 \geq C2 by symmetry. The compression capacity for the model is defined as the maximum average number of times that the function f can be compressed with zero error for one use of the system, which measures the efficiency of using the system. We fully characterize the compression capacities for all the four cases of the model (s1s2;C1,C2;f) for s1s2= 00,01,10,11. Here, the characterization of the compression capacity for the case (01;C1,C2;f) with C1>C2 is highly nontrivial, where a novel graph coloring approach is developed. Furthermore, we apply the compression capacity for (01;C1,C2;f) to an open problem in network function computation that whether the best known upper bound of Guang et al. on computing capacity is in general tight.
Gaussian process regression (GPR) is a non-parametric model that has been used in many real-world applications that involve sensitive personal data (e.g., healthcare, finance, etc.) from multiple data owners. To fully and securely exploit the value of different data sources, this paper proposes a privacy-preserving GPR method based on secret sharing (SS), a secure multi-party computation (SMPC) technique. In contrast to existing studies that protect the data privacy of GPR via homomorphic encryption, differential privacy, or federated learning, our proposed method is more practical and can be used to preserve the data privacy of both the model inputs and outputs for various data-sharing scenarios (e.g., horizontally/vertically-partitioned data). However, it is non-trivial to directly apply SS on the conventional GPR algorithm, as it includes some operations whose accuracy and/or efficiency have not been well-enhanced in the current SMPC protocol. To address this issue, we derive a new SS-based exponentiation operation through the idea of 'confusion-correction' and construct an SS-based matrix inversion algorithm based on Cholesky decomposition. More importantly, we theoretically analyze the communication cost and the security of the proposed SS-based operations. Empirical results show that our proposed method can achieve reasonable accuracy and efficiency under the premise of preserving data privacy.
In this paper, due to the important value in practical applications, we consider the coded distributed matrix multiplication problem of computing $AA^\top$ in a distributed computing system with $N$ worker nodes and a master node, where the input matrices $A$ and $A^\top$ are partitioned into $m$-by-$p$ and $p$-by-$m$ blocks of equal-size sub-matrices respectively. For effective straggler mitigation, we propose a novel computation strategy, named \emph{folded polynomial code}, which is obtained by modifying the entangled polynomial codes. Moreover, we characterize a lower bound on the optimal recovery threshold among all linear computation strategies when the underlying field is the real number field, and our folded polynomial codes can achieve this bound in the case of $m=1$. Compared with all known computation strategies for coded distributed matrix multiplication, our folded polynomial codes outperform them in terms of recovery threshold, download cost, and decoding complexity.
Unobserved confounding is a fundamental obstacle to establishing valid causal conclusions from observational data. Two complementary types of approaches have been developed to address this obstacle: obtaining identification using fortuitous external aids, such as instrumental variables or proxies, or by means of the ID algorithm, using Markov restrictions on the full data distribution encoded in graphical causal models. In this paper we aim to develop a synthesis of the former and latter approaches to identification in causal inference to yield the most general identification algorithm in multivariate systems currently known -- the proximal ID algorithm. In addition to being able to obtain nonparametric identification in all cases where the ID algorithm succeeds, our approach allows us to systematically exploit proxies to adjust for the presence of unobserved confounders that would have otherwise prevented identification. In addition, we outline a class of estimation strategies for causal parameters identified by our method in an important special case. We illustrate our approach by simulation studies and a data application.
This paper develops a fully distributed differentially-private learning algorithm to solve nonsmooth optimization problems. We distribute the Alternating Direction Method of Multipliers (ADMM) to comply with the distributed setting and employ an approximation of the augmented Lagrangian to handle nonsmooth objective functions. Furthermore, we ensure zero-concentrated differential privacy (zCDP) by perturbing the outcome of the computation at each agent with a variance-decreasing Gaussian noise. This privacy-preserving method allows for better accuracy than the conventional $(\epsilon, \delta)$-DP and stronger guarantees than the more recent R\'enyi-DP. The developed fully distributed algorithm has a competitive privacy accuracy trade-off and handles nonsmooth and non-necessarily strongly convex problems. We provide complete theoretical proof for the privacy guarantees and the convergence of the algorithm to the exact solution. We also prove under additional assumptions that the algorithm converges in linear time. Finally, we observe in simulations that the developed algorithm outperforms all of the existing methods.
We consider the problem of testing for the martingale difference hypothesis for univariate strictly stationary time series by implementing a novel test for conditional mean independence based on the concept of martingale difference divergence. The martingale difference divergence function allows us to measure the degree to which a certain variable is conditionally mean dependent upon its past values: in particular, it does so by computing the regularized norm of the covariance between the current value of the variable and the characteristic function of its past values. In this paper, we make use of such a concept, along with the theoretical framework of generalized spectral density, to construct a Ljung-Box type test for the martingale difference hypothesis. In addition to the results obtained with the implementation of the test statistic, we proceed to show some asymptotics for martingale difference divergence in the time series framework.
This paper proposes an adaptive penalized weighted mean regression for outlier detection of high-dimensional data. In comparison to existing approaches based on the mean shift model, the proposed estimators demonstrate robustness against outliers present in both response variables and/or covariates. By utilizing the adaptive Huber loss function, the proposed method is effective in high-dimensional linear models characterized by heavy-tailed and heteroscedastic error distributions. The proposed framework enables simultaneous and collaborative estimation of regression parameters and outlier detection. Under regularity conditions, outlier detection consistency and oracle inequalities of robust estimates in high-dimensional settings are established. Additionally, theoretical robustness properties, such as the breakdown point and a smoothed limiting influence function, are ascertained. Extensive simulation studies and a breast cancer survival data are used to evaluate the numerical performance of the proposed method, demonstrating comparable or superior variable selection and outlier detection capabilities.
Penalized regression methods such as ridge regression heavily rely on the choice of a tuning or penalty parameter, which is often computed via cross-validation. Discrepancies in the value of the penalty parameter may lead to substantial differences in regression coefficient estimates and predictions. In this paper, we investigate the effect of single observations on the optimal choice of the tuning parameter, showing how the presence of influential points can change it dramatically. We distinguish between points as ``expanders'' and ``shrinkers'', based on their effect on the model complexity. Our approach supplies a visual exploratory tool to identify influential points, naturally implementable for high-dimensional data where traditional approaches usually fail. Applications to simulated and real data examples, both low- and high-dimensional, are presented. The visual tool is implemented in the R package influridge.
We consider a new framework where a continuous, though bounded, random variable has unobserved bounds that vary over time. In the context of univariate time series, we look at the bounds as parameters of the distribution of the bounded random variable. We introduce an extended log-likelihood estimation and design algorithms to track the bound through online maximum likelihood estimation. Since the resulting optimization problem is not convex, we make use of recent theoretical results on Normalized Gradient Descent (NGD) for quasiconvex optimization, to eventually derive an Online Normalized Gradient Descent algorithm. We illustrate and discuss the workings of our approach based on both simulation studies and a real-world wind power forecasting problem.
We consider a symmetric mixture of linear regressions with random samples from the pairwise comparison design, which can be seen as a noisy version of a type of Euclidean distance geometry problem. We analyze the expectation-maximization (EM) algorithm locally around the ground truth and establish that the sequence converges linearly, providing an $\ell_\infty$-norm guarantee on the estimation error of the iterates. Furthermore, we show that the limit of the EM sequence achieves the sharp rate of estimation in the $\ell_2$-norm, matching the information-theoretically optimal constant. We also argue through simulation that convergence from a random initialization is much more delicate in this setting, and does not appear to occur in general. Our results show that the EM algorithm can exhibit several unique behaviors when the covariate distribution is suitably structured.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.