In this paper, tests of symmetry for bivariate copulas are introduced and studied using empirical Bernstein copula. Three statistics are proposed and their asymptotic properties are established. Besides, a multiplier bootstrap Bernstein version is investigated for implementation purpose. Simulation study and real data application showed that the Bernstein tests outperform the tests based on the empirical copula.
This paper introduces subject granular privacy in the Federated Learning (FL) setting, where a subject is an individual whose private information is embodied by several data items either confined within a single federation user or distributed across multiple federation users. We formally define the notion of subject level differential privacy for FL. We propose three new algorithms that enforce subject level DP. Two of these algorithms are based on notions of user level local differential privacy (LDP) and group differential privacy respectively. The third algorithm is based on a novel idea of hierarchical gradient averaging (HiGradAvgDP) for subjects participating in a training mini-batch. We also introduce horizontal composition of privacy loss for a subject across multiple federation users. We show that horizontal composition is equivalent to sequential composition in the worst case. We prove the subject level DP guarantee for all our algorithms and empirically analyze them using the FEMNIST and Shakespeare datasets. Our evaluation shows that, of our three algorithms, HiGradAvgDP delivers the best model performance, approaching that of a model trained using a DP-SGD based algorithm that provides a weaker item level privacy guarantee.
We develop an \textit{a posteriori} error analysis for a numerical estimate of the time at which a functional of the solution to a partial differential equation (PDE) first achieves a threshold value on a given time interval. This quantity of interest (QoI) differs from classical QoIs which are modeled as bounded linear (or nonlinear) functionals {of the solution}. Taylor's theorem and an adjoint-based \textit{a posteriori} analysis is used to derive computable and accurate error estimates in the case of semi-linear parabolic and hyperbolic PDEs. The accuracy of the error estimates is demonstrated through numerical solutions of the one-dimensional heat equation and linearized shallow water equations (SWE), representing parabolic and hyperbolic cases, respectively.
Considering two random variables with different laws to which we only have access through finite size iid samples, we address how to reweight the first sample so that its empirical distribution converges towards the true law of the second sample as the size of both samples goes to infinity. We study an optimal reweighting that minimizes the Wasserstein distance between the empirical measures of the two samples, and leads to an expression of the weights in terms of Nearest Neighbors. The consistency and some asymptotic convergence rates in terms of expected Wasserstein distance are derived, and do not need the assumption of absolute continuity of one random variable with respect to the other. These results have some application in Uncertainty Quantification for decoupled estimation and in the bound of the generalization error for the Nearest Neighbor Regression under covariate shift.
We propose a new model-free feature screening method based on energy distances for ultrahigh-dimensional binary classification problems. Unlike existing methods, the cut-off involved in our procedure is data adaptive. With a high probability, the proposed method retains only relevant features after discarding all the noise variables. The proposed screening method is also extended to identify pairs of variables that are marginally undetectable, but have differences in their joint distributions. Finally, we build a classifier which maintains coherence between the proposed feature selection criteria and discrimination method, and also establish its risk consistency. An extensive numerical study with simulated data sets and real benchmark data sets show clear and convincing advantages of our classifier over the state-of-the-art methods.
The Volume-Averaged Navier-Stokes equations are used to study fluid flow in the presence of fixed or moving solids such as packed or fluidized beds. We develop a high-order finite element solver using both forms A and B of these equations. We introduce tailored stabilization techniques to prevent oscillations in regions of sharp gradients, to relax the Ladyzhenskaya-Babuska-Brezzi inf-sup condition, and to enhance the local mass conservation and the robustness of the formulation. We calculate the void fraction using the Particle Centroid Method. Using different drag models, we calculate the drag force exerted by the solids on the fluid. We implement the method of manufactured solution to verify our solver. We demonstrate that the model preserves the order of convergence of the underlying finite element discretization. Finally, we simulate gas flow through a randomly packed bed and study the pressure drop and mass conservation properties to validate our model.
Analyzing time series in the frequency domain enables the development of powerful tools for investigating the second-order characteristics of multivariate stochastic processes. Parameters like the spectral density matrix and its inverse, the coherence or the partial coherence, encode comprehensively the complex linear relations between the component processes of the multivariate system. In this paper, we develop inference procedures for such parameters in a high-dimensional, time series setup. In particular, we first focus on the derivation of consistent estimators of the coherence and, more importantly, of the partial coherence which possess manageable limiting distributions that are suitable for testing purposes. Statistical tests of the hypothesis that the maximum over frequencies of the coherence, respectively, of the partial coherence, do not exceed a prespecified threshold value are developed. Our approach allows for testing hypotheses for individual coherences and/or partial coherences as well as for multiple testing of large sets of such parameters. In the latter case, a consistent procedure to control the false discovery rate is developed. The finite sample performance of the inference procedures proposed is investigated by means of simulations and applications to the construction of graphical interaction models for brain connectivity based on EEG data are presented.
Existing visual SLAM approaches are sensitive to illumination, with their precision drastically falling in dark conditions due to feature extractor limitations. The algorithms currently used to overcome this issue are not able to provide reliable results due to poor performance and noisiness, and the localization quality in dark conditions is still insufficient for practical use. In this paper, we present a novel SLAM method capable of working in low light using Generative Adversarial Network (GAN) preprocessing module to enhance the light conditions on input images, thus improving the localization robustness. The proposed algorithm was evaluated on a custom indoor dataset consisting of 14 sequences with varying illumination levels and ground truth data collected using a motion capture system. According to the experimental results, the reliability of the proposed approach remains high even in extremely low light conditions, providing 25.1% tracking time on darkest sequences, whereas existing approaches achieve tracking only 0.6% of the sequence time.
Data collected in clinical trials are often composed of multiple types of variables. For example, laboratory measurements and vital signs are longitudinal data of continuous or categorical variables, adverse events may be recurrent events, and death is a time-to-event variable. Missing data due to patients' discontinuation from the study or as a result of handling intercurrent events using a hypothetical strategy almost always occur during any clinical trial. Imputing these data with mixed types of variables simultaneously is a challenge that has not been studied. In this article, we propose using an approximate fully conditional specification to impute the missing data. Simulation shows the proposed method provides satisfactory results under the assumption of missing at random. Finally, real data from a major diabetes clinical trial are analyzed to illustrate the potential benefit of the proposed method.
In reliability and life data analysis, the Weibull distribution is widely used to accommodate more data characteristics by changing the values of the parameters. We frequently observe many zeros or close to zero data points in reliability and life testing experiments. We call this phenomenon a nearly instantaneous failure. Many researchers modified the commonly used univariate parametric models such as exponential, gamma, Weibull, and log-normal distributions to appropriately fit such data having instantaneous failure observations. Researchers also find bivariate correlated life testing data having many observations near a particular point while the remaining observations follow some continuous distribution. This situation defines as responses having early failures for such bivariate responses. If the point is the origin, then we call the situation a nearly instantaneous failure for the responses. Here, we propose a modified bivariate Weibull distribution that allows early failure by combining bivariate uniform distribution and bivariate Weibull distribution. The bivariate Weibull distribution is constructed using a 2-dimensional copula, assuming the marginal distributions as two parametric Weibull distributions. We derive some properties of that modified bivariate Weibull distribution, mainly the joint probability density function, the survival (reliability) function, and the hazard (failure rate) function. The model's unknown parameters are estimated using the Maximum Likelihood Estimation (MLE) technique combined with a machine learning clustering algorithm. Numerical examples are provided using simulated data to illustrate and test the performance of the proposed methodologies. The method is also applied to real data and compared with existing approaches to model such data in the literature.
We give the first polynomial-time algorithm to estimate the mean of a $d$-variate probability distribution with bounded covariance from $\tilde{O}(d)$ independent samples subject to pure differential privacy. Prior algorithms for this problem either incur exponential running time, require $\Omega(d^{1.5})$ samples, or satisfy only the weaker concentrated or approximate differential privacy conditions. In particular, all prior polynomial-time algorithms require $d^{1+\Omega(1)}$ samples to guarantee small privacy loss with "cryptographically" high probability, $1-2^{-d^{\Omega(1)}}$, while our algorithm retains $\tilde{O}(d)$ sample complexity even in this stringent setting. Our main technique is a new approach to use the powerful Sum of Squares method (SoS) to design differentially private algorithms. SoS proofs to algorithms is a key theme in numerous recent works in high-dimensional algorithmic statistics -- estimators which apparently require exponential running time but whose analysis can be captured by low-degree Sum of Squares proofs can be automatically turned into polynomial-time algorithms with the same provable guarantees. We demonstrate a similar proofs to private algorithms phenomenon: instances of the workhorse exponential mechanism which apparently require exponential time but which can be analyzed with low-degree SoS proofs can be automatically turned into polynomial-time differentially private algorithms. We prove a meta-theorem capturing this phenomenon, which we expect to be of broad use in private algorithm design. Our techniques also draw new connections between differentially private and robust statistics in high dimensions. In particular, viewed through our proofs-to-private-algorithms lens, several well-studied SoS proofs from recent works in algorithmic robust statistics directly yield key components of our differentially private mean estimation algorithm.