Joint modeling of a large number of variables often requires dimension reduction strategies that lead to structural assumptions of the underlying correlation matrix, such as equal pair-wise correlations within subsets of variables. The underlying correlation matrix is thus of interest for both model specification and model validation. In this paper, we develop tests of the hypothesis that the entries of the Kendall rank correlation matrix are linear combinations of a smaller number of parameters. The asymptotic behavior of the proposed test statistics is investigated both when the dimension is fixed and when it grows with the sample size. We pay special attention to the restricted hypothesis of partial exchangeability, which contains full exchangeability as a special case. We show that under partial exchangeability, the test statistics and their large-sample distributions simplify, which leads to computational advantages and better performance of the tests. We propose various scalable numerical strategies for implementation of the proposed procedures, investigate their behavior through simulations and power calculations under local alternatives, and demonstrate their use on a real dataset of mean sea levels at various geographical locations.
Ensemble Kalman inversion (EKI) is a technique for the numerical solution of inverse problems. A great advantage of the EKI's ensemble approach is that derivatives are not required in its implementation. But theoretically speaking, EKI's ensemble size needs to surpass the dimension of the problem. This is because of EKI's "subspace property", i.e., that the EKI solution is a linear combination of the initial ensemble it starts off with. We show that the ensemble can break out of this initial subspace when "localization" is applied. In essence, localization enforces an assumed correlation structure onto the problem, and is heavily used in ensemble Kalman filtering and data assimilation. We describe and analyze how to apply localization to the EKI, and how localization helps the EKI ensemble break out of the initial subspace. Specifically, we show that the localized EKI (LEKI) ensemble will collapse to a single point (as intended) and that the LEKI ensemble mean will converge to the global optimum at a sublinear rate. Under strict assumptions on the localization procedure and observation process, we further show that the data misfit decays uniformly. We illustrate our ideas and theoretical developments with numerical examples with simplified toy problems, a Lorenz model, and an inversion of electromagnetic data, where some of our mathematical assumptions may only be approximately valid.
Recently, conditional average treatment effect (CATE) estimation has been attracting much attention due to its importance in various fields such as statistics, social and biomedical sciences. This study proposes a partially linear nonparametric Bayes model for the heterogeneous treatment effect estimation. A partially linear model is a semiparametric model that consists of linear and nonparametric components in an additive form. A nonparametric Bayes model that uses a Gaussian process to model the nonparametric component has already been studied. However, this model cannot handle the heterogeneity of the treatment effect. In our proposed model, not only the nonparametric component of the model but also the heterogeneous treatment effect of the treatment variable is modeled by a Gaussian process prior. We derive the analytic form of the posterior distribution of the CATE and prove that the posterior has the consistency property. That is, it concentrates around the true distribution. We show the effectiveness of the proposed method through numerical experiments based on synthetic data.
Directed Acyclic Graphs (DAGs) provide a powerful framework to model causal relationships among variables in multivariate settings; in addition, through the do-calculus theory, they allow for the identification and estimation of causal effects between variables also from pure observational data. In this setting, the process of inferring the DAG structure from the data is referred to as causal structure learning or causal discovery. We introduce BCDAG, an R package for Bayesian causal discovery and causal effect estimation from Gaussian observational data, implementing the Markov chain Monte Carlo (MCMC) scheme proposed by Castelletti & Mascaro (2021). Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, with the number of variables in the dataset. The package also provides functions for convergence diagnostics and for visualizing and summarizing posterior inference. In this paper, we present the key features of the underlying methodology along with its implementation in BCDAG. We then illustrate the main functions and algorithms on both real and simulated datasets.
Isogeometric Analysis generalizes classical finite element analysis and intends to integrate it with the field of Computer-Aided Design. A central problem in achieving this objective is the reconstruction of analysis-suitable models from Computer-Aided Design models, which is in general a non-trivial and time-consuming task. In this article, we present a novel spline construction, that enables model reconstruction as well as simulation of high-order PDEs on the reconstructed models. The proposed almost-$C^1$ are biquadratic splines on fully unstructured quadrilateral meshes (without restrictions on placements or number of extraordinary vertices). They are $C^1$ smooth almost everywhere, that is, at all vertices and across most edges, and in addition almost (i.e. approximately) $C^1$ smooth across all other edges. Thus, the splines form $H^2$-nonconforming analysis-suitable discretization spaces. This is the lowest-degree unstructured spline construction that can be used to solve fourth-order problems. The associated spline basis is non-singular and has several B-spline-like properties (e.g., partition of unity, non-negativity, local support), the almost-$C^1$ splines are described in an explicit B\'ezier-extraction-based framework that can be easily implemented. Numerical tests suggest that the basis is well-conditioned and exhibits optimal approximation behavior.
While self-training has advanced semi-supervised semantic segmentation, it severely suffers from the long-tailed class distribution on real-world semantic segmentation datasets that make the pseudo-labeled data bias toward majority classes. In this paper, we present a simple and yet effective Distribution Alignment and Random Sampling (DARS) method to produce unbiased pseudo labels that match the true class distribution estimated from the labeled data. Besides, we also contribute a progressive data augmentation and labeling strategy to facilitate model training with pseudo-labeled data. Experiments on both Cityscapes and PASCAL VOC 2012 datasets demonstrate the effectiveness of our approach. Albeit simple, our method performs favorably in comparison with state-of-the-art approaches. Code will be available at //github.com/CVMI-Lab/DARS.
The focus of disentanglement approaches has been on identifying independent factors of variation in data. However, the causal variables underlying real-world observations are often not statistically independent. In this work, we bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data in a large-scale empirical study (including 4260 models). We show and quantify that systematically induced correlations in the dataset are being learned and reflected in the latent representations, which has implications for downstream applications of disentanglement such as fairness. We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments to time-series in general dimension. For $\ell_p$-products of Euclidean metrics, for any $p$, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fr\'echet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms, our algorithm is especially efficient when the length of the curves is bounded.
In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both a word-based baseline language model and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model's ability to learn to predict appropriate relations in context.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.