Variables contained within the global oceans can detect and reveal the effects of the warming climate as the oceans absorb huge amounts of solar energy. Hence, information regarding the joint spatial distribution of ocean variables is critical for climate monitoring. In this paper, we investigate the spatial correlation structure between ocean temperature and salinity using data harvested from the Argo program and construct a model to capture their bivariate spatial dependence from the surface to the ocean's interior. We develop a flexible class of multivariate nonstationary covariance models defined in 3-dimensional (3D) space (longitude x latitude x depth) that allows for the variances and correlation to change along the vertical pressure dimension. These models are able to describe the joint spatial distribution of the two variables while incorporating the underlying vertical structure of the ocean. We demonstrate that proposed cross-covariance models describe the complex vertical cross-covariance structure well, while existing cross-covariance models including bivariate Mat\'{e}rn models poorly fit empirical cross-covariance structure. Furthermore, the results show that using one more variable significantly enhances the prediction of the other variable and that the estimated spatial dependence structures are consistent with the ocean stratification.
We present a method to infer on joint regression coefficients obtained from marginal regressions using a reference panel. This type of scenario is common in genetic fine-mapping, where the estimated marginal associations are reported in genomewide association studies (GWAS), and a reference panel is used for inference on the association in a joint regression model. We show that ignoring the uncertainty due to the use of a reference panel instead of the original design matrix, can lead to a severe inflation of false discoveries and a lack of replicable findings. We derive the asymptotic distribution of the estimated coefficients in the joint regression model, and show how it can be used to produce valid inference. We address two settings: inference within regions that are pre-selected, as well as within regions that are selected based on the same data. By means of real data examples and simulations we demonstrate the usefulness of our suggested methodology.
An inference procedure is proposed to provide consistent estimators of parameters in a modal regression model with a covariate prone to measurement error. A score-based diagnostic tool exploiting parametric bootstrap is developed to assess adequacy of parametric assumptions imposed on the regression model. The proposed estimation method and diagnostic tool are applied to synthetic data generated from simulation experiments and data from real-world applications to demonstrate their implementation and performance. These empirical examples illustrate the importance of adequately accounting for measurement error in the error-prone covariate when inferring the association between a response and covariates based on a modal regression model that is especially suitable for skewed and heavy-tailed response data.
The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Besides, decorrelating within stable variables may bring about high variance of learned models because of the over-reduced effective sample size. A tremendous sample size is required for these algorithms to work. In this paper, with theoretical justification, we propose SVI (Sparse Variable Independence) for the covariate-shift generalization problem. We introduce sparsity constraint to compensate for the imperfectness of sample reweighting under the finite-sample setting in previous methods. Furthermore, we organically combine independence-based sample reweighting and sparsity-based variable selection in an iterative way to avoid decorrelating within stable variables, increasing the effective sample size to alleviate variance inflation. Experiments on both synthetic and real-world datasets demonstrate the improvement of covariate-shift generalization performance brought by SVI.
We propose new differential privacy solutions for when external \emph{invariants} and \emph{integer} constraints are simultaneously enforced on the data product. These requirements arise in real world applications of private data curation, including the public release of the 2020 U.S. Decennial Census. They pose a great challenge to the production of provably private data products with adequate statistical usability. We propose \emph{integer subspace differential privacy} to rigorously articulate the privacy guarantee when data products maintain both the invariants and integer characteristics, and demonstrate the composition and post-processing properties of our proposal. To address the challenge of sampling from a potentially highly restricted discrete space, we devise a pair of unbiased additive mechanisms, the generalized Laplace and the generalized Gaussian mechanisms, by solving the Diophantine equations as defined by the constraints. The proposed mechanisms have good accuracy, with errors exhibiting sub-exponential and sub-Gaussian tail probabilities respectively. To implement our proposal, we design an MCMC algorithm and supply empirical convergence assessment using estimated upper bounds on the total variation distance via $L$-lag coupling. We demonstrate the efficacy of our proposal with applications to a synthetic problem with intersecting invariants, a sensitive contingency table with known margins, and the 2010 Census county-level demonstration data with mandated fixed state population totals.
Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst-group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply it to new tasks with potentially multiple unknown spurious correlations. We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization -- an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO -- datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.
Accurate and ubiquitous localization is crucial for a variety of applications such as logistics, navigation, intelligent transport, monitoring, control, and also for the benefit of communications. Exploiting millimeter-wave (mmWave) signals in 5G and Beyond 5G systems can provide accurate localization with limited infrastructure. We consider the single base station (BS) localization problem and extend it to 3D position and 3D orientation estimation of an unsynchronized multi-antenna user equipment (UE), using downlink multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) signals. Through a Fisher information analysis, we show that the problem is often identifiable, provided that there is at least one multipath component in addition to the line-of-sight (LoS), even if the position of corresponding incidence point (IP) is a priori unknown. Subsequently, we pose a maximum likelihood (ML) estimation problem, to jointly estimate the 3D position and 3D orientation of the UE as well as several nuisance parameters (the UE clock offset and the positions of IPs corresponding to the multipath). The ML problem is a high-dimensional non-convex optimization problem over a product of Euclidean and non-Euclidean manifolds. To avoid complex exhaustive search procedures, we propose a geometric initial estimate of all parameters, which reduces the problem to a 1-dimensional search over a finite interval. Numerical results show the efficiency of the proposed ad-hoc estimation, whose gap to the Cram\'er-Rao bound (CRB) is tightened using the ML estimation.
Incorporating equivariance to symmetry groups as a constraint during neural network training can improve performance and generalization for tasks exhibiting those symmetries, but such symmetries are often not perfectly nor explicitly present. This motivates algorithmically optimizing the architectural constraints imposed by equivariance. We propose the equivariance relaxation morphism, which preserves functionality while reparameterizing a group equivariant layer to operate with equivariance constraints on a subgroup, as well as the [G]-mixed equivariant layer, which mixes layers constrained to different groups to enable within-layer equivariance optimization. We further present evolutionary and differentiable neural architecture search (NAS) algorithms that utilize these mechanisms respectively for equivariance-aware architectural optimization. Experiments across a variety of datasets show the benefit of dynamically constrained equivariance to find effective architectures with approximate equivariance.
A simultaneously transmitting and reflecting surface (STARS) aided terahertz (THz) communication system is proposed. A novel power consumption model depending on the type and the resolution of individual elements is proposed for the STARS. Then, the system energy efficiency (EE) and spectral efficiency (SE) are maximized in both narrowband and wideband THz systems. 1) For the narrowband system, an iterative algorithm based on penalty dual decomposition is proposed to jointly optimize the hybrid beamforming at the base station (BS) and the independent phase-shift coefficients at the STARS. The proposed algorithm is then extended to the coupled phase-shift STARS. 2) For the wideband system, to eliminate the beam split effect, a time-delay (TD) network implemented by the true-time-delayers is applied in the hybrid beamforming structure. An iterative algorithm based on the quasi-Newton method is proposed to design the coefficients of the TD network. Finally, our numerical results reveal that i) there is a slight performance loss of EE and SE caused by coupled phase shifts of the STARS in both narrowband and wideband systems, and ii) the conventional hybrid beamforming achieved close performance of EE and SE to the full-digital one in the narrowband system, but not in the wideband system where the TD-based hybrid beamforming is more efficient.
This paper makes 3 contributions. First, it generalizes the Lindeberg\textendash Feller and Lyapunov Central Limit Theorems to Hilbert Spaces by way of $L^2$. Second, it generalizes these results to spaces in which sample failure and missingness can occur. Finally, it shows that satisfaction of the Lindeberg\textendash Feller and Lyapunov Conditions in such spaces implies the satisfaction of the conditions in the completely observed space, and how this guarantees the consistency of inferences from the partial functional data. These latter two results are especially important given the increasing attention to statistical inference with partially observed functional data. This paper goes beyond previous research by providing simple boundedness conditions which guarantee that \textit{all} inferences, as opposed to some proper subset of them, will be consistently estimated. This is shown primarily by aggregating conditional expectations with respect to the space of missingness patterns. This paper appears to be the first to apply this technique.
Over the past few years, we have seen fundamental breakthroughs in core problems in machine learning, largely driven by advances in deep neural networks. At the same time, the amount of data collected in a wide array of scientific domains is dramatically increasing in both size and complexity. Taken together, this suggests many exciting opportunities for deep learning applications in scientific settings. But a significant challenge to this is simply knowing where to start. The sheer breadth and diversity of different deep learning techniques makes it difficult to determine what scientific problems might be most amenable to these methods, or which specific combination of methods might offer the most promising first approach. In this survey, we focus on addressing this central issue, providing an overview of many widely used deep learning models, spanning visual, sequential and graph structured data, associated tasks and different training methods, along with techniques to use deep learning with less data and better interpret these complex models --- two central considerations for many scientific use cases. We also include overviews of the full design process, implementation tips, and links to a plethora of tutorials, research summaries and open-sourced deep learning pipelines and pretrained models, developed by the community. We hope that this survey will help accelerate the use of deep learning across different scientific domains.