Inference from limited data requires a notion of measure on parameter space, most explicit in the Bayesian framework as a prior. Here we demonstrate that Jeffreys prior, the best-known uninformative choice, introduces enormous bias when applied to typical scientific models. Such models have a relevant effective dimensionality much smaller than the number of microscopic parameters. Because Jeffreys prior treats all microscopic parameters equally, it is from uniform when projected onto the sub-space of relevant parameters, due to variations in the local co-volume of irrelevant directions. We present results on a principled choice of measure which avoids this issue, leading to unbiased inference in complex models. This optimal prior depends on the quantity of data to be gathered, and approaches Jeffreys prior in the asymptotic limit. However, this limit cannot be justified without an impossibly large amount of data, exponential in the number of microscopic parameters.
Natural language generation models reproduce and often amplify the biases present in their training data. Previous research explored using sequence-to-sequence rewriting models to transform biased model outputs (or original texts) into more gender-fair language by creating pseudo training data through linguistic rules. However, this approach is not practical for languages with more complex morphology than English. We hypothesise that creating training data in the reverse direction, i.e. starting from gender-fair text, is easier for morphologically complex languages and show that it matches the performance of state-of-the-art rewriting models for English. To eliminate the rule-based nature of data creation, we instead propose using machine translation models to create gender-biased text from real gender-fair text via round-trip translation. Our approach allows us to train a rewriting model for German without the need for elaborate handcrafted rules. The outputs of this model increased gender-fairness as shown in a human evaluation study.
We study the problem of change point (CP) detection with high dimensional time series, within the framework of frequency domain. The overarching goal is to locate all change points and for each change point, delineate which series are activated by the change, over which set of frequencies. The working assumption is that only a few series are activated per change and frequency. We solve the problem by computing a CUSUM tensor based on spectra estimated from blocks of the observed time series. A frequency-specific projection approach is applied to the CUSUM tensor for dimension reduction. The projection direction is estimated by a proposed sparse tensor decomposition algorithm. Finally, the projected CUSUM vectors across frequencies are aggregated by a sparsified wild binary segmentation for change point detection. We provide theoretical guarantees on the number of estimated change points and the convergence rate of their locations. We derive error bounds for the estimated projection direction for identifying the frequency-specific series that are activated in a change. We provide data-driven rules for the choice of parameters. We illustrate the efficacy of the proposed method by simulation and a stock returns application.
In this work, we propose a new stochastic domain decomposition method for solving steady-state partial differential equations (PDEs) with random inputs. Based on the efficiency of the Variable-separation (VS) method in simulating stochastic partial differential equations (SPDEs), we extend it to stochastic algebraic systems and apply it to stochastic domain decomposition. The resulting Stochastic Domain Decomposition based on the Variable-separation method (SDD-VS) effectively addresses the ``curse of dimensionality" by leveraging the explicit representation of stochastic functions derived from physical systems. The SDD-VS method aims to obtain a separated representation of the solution for the stochastic interface problem. To enhance efficiency, an offline-online computational decomposition is introduced. In the offline phase, the affine representation of stochastic algebraic systems is obtained through the successive application of the VS method. This serves as a crucial foundation for the SDD-VS method. In the online phase, the interface unknowns of SPDEs are estimated using a quasi-optimal separated representation, enabling the construction of efficient surrogate models for subproblems. The effectiveness of the proposed method is demonstrated via the numerical results of three concrete examples.
Extended Dynamic Mode Decomposition (EDMD) is a data-driven tool for forecasting and model reduction of dynamics, which has been extensively taken up in the physical sciences. While the method is conceptually simple, in deterministic chaos it is unclear what its properties are or even what it converges to. In particular, it is not clear how EDMD's least-squares approximation treats the classes of regular functions needed to make sense of chaotic dynamics. In this paper we develop a general, rigorous theory of EDMD on the simplest examples of chaotic maps: analytic expanding maps of the circle. To do this, we prove a new result in the theory of orthogonal polynomials on the unit circle (OPUC) and apply methods from transfer operator theory. We show that in the infinite-data limit, the least-squares projection is exponentially efficient for trigonometric polynomial observable dictionaries. As a result, we show that the forecasts and Koopman spectral data produced using EDMD in this setting converge to the physically meaningful limits, exponentially quickly in the size of the dictionary. This demonstrates that with only a relatively small polynomial dictionary, EDMD can be very effective, even when the sampling measure is not uniform. Furthermore, our OPUC result suggests that data-based least-squares projections may be a very effective approximation strategy.
Empirical interpolation method (EIM) is a well-known technique to efficiently approximate parameterized functions. This paper proposes to use EIM algorithm to efficiently reduce the dimension of the training data within supervised machine learning. This is termed as DNN-EIM. Applications in data science (e.g., MNIST) and parameterized (and time-dependent) partial differential equations (PDEs) are considered. The proposed DNNs in case of classification are trained in parallel for each class. This approach is sequential, i.e., new classes can be added without having to retrain the network. In case of PDEs, a DNN is designed corresponding to each EIM point. Again, these networks can be trained in parallel, for each EIM point. In all cases, the parallel networks require fewer than ten times the number of training weights. Significant gains are observed in terms of training times, without sacrificing accuracy.
L-moments are expected values of linear combinations of order statistics that provide robust alternatives to traditional moments. The estimation of parametric models by matching sample L-moments -- a procedure known as ``method of L-moments'' -- has been shown to outperform maximum likelihood estimation (MLE) in small samples from popular distributions. The choice of the number of L-moments to be used in estimation remains \textit{ad-hoc}, though: researchers typically set the number of L-moments equal to the number of parameters, as to achieve an order condition for identification. This approach is generally inefficient in larger sample sizes. In this paper, we show that, by properly choosing the number of L-moments and weighting these accordingly, we are able to construct an estimator that outperforms MLE in finite samples, and yet does not suffer from efficiency losses asymptotically. We do so by considering a ``generalised'' method of L-moments estimator and deriving its asymptotic properties in a framework where the number of L-moments varies with sample size. We then propose methods to automatically select the number of L-moments in a given sample. Monte Carlo evidence shows our proposed approach is able to outperform (in a mean-squared error sense) MLE in smaller samples, whilst working as well as it in larger samples.
Collecting supporting evidence from large corpora of text (e.g., Wikipedia) is of great challenge for open-domain Question Answering (QA). Especially, for multi-hop open-domain QA, scattered evidence pieces are required to be gathered together to support the answer extraction. In this paper, we propose a new retrieval target, hop, to collect the hidden reasoning evidence from Wikipedia for complex question answering. Specifically, the hop in this paper is defined as the combination of a hyperlink and the corresponding outbound link document. The hyperlink is encoded as the mention embedding which models the structured knowledge of how the outbound link entity is mentioned in the textual context, and the corresponding outbound link document is encoded as the document embedding representing the unstructured knowledge within it. Accordingly, we build HopRetriever which retrieves hops over Wikipedia to answer complex questions. Experiments on the HotpotQA dataset demonstrate that HopRetriever outperforms previously published evidence retrieval methods by large margins. Moreover, our approach also yields quantifiable interpretations of the evidence collection process.
Person re-identification (\textit{re-id}) refers to matching pedestrians across disjoint yet non-overlapping camera views. The most effective way to match these pedestrians undertaking significant visual variations is to seek reliably invariant features that can describe the person of interest faithfully. Most of existing methods are presented in a supervised manner to produce discriminative features by relying on labeled paired images in correspondence. However, annotating pair-wise images is prohibitively expensive in labors, and thus not practical in large-scale networked cameras. Moreover, seeking comparable representations across camera views demands a flexible model to address the complex distributions of images. In this work, we study the co-occurrence statistic patterns between pairs of images, and propose to crossing Generative Adversarial Network (Cross-GAN) for learning a joint distribution for cross-image representations in a unsupervised manner. Given a pair of person images, the proposed model consists of the variational auto-encoder to encode the pair into respective latent variables, a proposed cross-view alignment to reduce the view disparity, and an adversarial layer to seek the joint distribution of latent representations. The learned latent representations are well-aligned to reflect the co-occurrence patterns of paired images. We empirically evaluate the proposed model against challenging datasets, and our results show the importance of joint invariant features in improving matching rates of person re-id with comparison to semi/unsupervised state-of-the-arts.