The Linear Model of Co-regionalization (LMC) is a very general model of multitask gaussian process for regression or classification. While its expressivity and conceptual simplicity are appealing, naive implementations have cubic complexity in the number of datapoints and number of tasks, making approximations mandatory for most applications. However, recent work has shown that under some conditions the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes. We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model. We introduce a full parametrization of the resulting \emph{projected LMC} model, and an expression of the marginal likelihood enabling efficient optimization. We perform a parametric study on synthetic data to show the excellent performance of our approach, compared to an unrestricted exact LMC and approximations of the latter. Overall, the projected LMC appears as a credible and simpler alternative to state-of-the art models, which greatly facilitates some computations such as leave-one-out cross-validation and fantasization.
The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.
We tackle the problem of Non-stochastic Control (NSC) with the aim of obtaining algorithms whose policy regret is proportional to the difficulty of the controlled environment. Namely, we tailor the Follow The Regularized Leader (FTRL) framework to dynamical systems by using regularizers that are proportional to the actual witnessed costs. The main challenge arises from using the proposed adaptive regularizers in the presence of a state, or equivalently, a memory, which couples the effect of the online decisions and requires new tools for bounding the regret. Via new analysis techniques for NSC and FTRL integration, we obtain novel disturbance action controllers (DAC) with sub-linear data adaptive policy regret bounds that shrink when the trajectory of costs has small gradients, while staying sub-linear even in the worst case.
We characterize the strength, in terms of Weihrauch degrees, of certain problems related to Ramsey-like theorems concerning colourings of the rationals and of the natural numbers. The theorems we are chiefly interested in assert the existence of almost-homogeneous sets for colourings of pairs of rationals respectively natural numbers satisfying properties determined by some additional algebraic structure on the set of colours. In the context of reverse mathematics, most of the principles we study are equivalent to $\Sigma^0_2$-induction over $\mathrm{RCA}_0$. The associated problems in the Weihrauch lattice are related to $\mathrm{TC}_\mathbb{N}^*$, $(\mathrm{LPO}')^*$ or their product, depending on their precise formalizations.
We consider the application of the generalized Convolution Quadrature (gCQ) to approximate the solution of an important class of sectorial problems. The gCQ is a generalization of Lubich's Convolution Quadrature (CQ) that allows for variable steps. The available stability and convergence theory for the gCQ requires non realistic regularity assumptions on the data, which do not hold in many applications of interest, such as the approximation of subdiffusion equations. It is well known that for non smooth enough data the original CQ, with uniform steps, presents an order reduction close to the singularity. We generalize the analysis of the gCQ to data satisfying realistic regularity assumptions and provide sufficient conditions for stability and convergence on arbitrary sequences of time points. We consider the particular case of graded meshes and show how to choose them optimally, according to the behaviour of the data. An important advantage of the gCQ method is that it allows for a fast and memory reduced implementation. We describe how the fast and oblivious gCQ can be implemented and illustrate our theoretical results with several numerical experiments.
We develop an online learning MCMC approach applicable for hierarchical bayesian models and GLMS. We also develop a fat-tailed LTV model that generalizes over several kinds of fat and thin tails. We demonstrate both developments on commercial LTV data from a large mobile app.
Experimental and observational studies often lead to spurious association between the outcome and independent variables describing the intervention, because of confounding to third-party factors. Even in randomized clinical trials, confounding might be unavoidable due to small sample sizes. Practically, this poses a problem, because it is either expensive to re-design and conduct a new study or even impossible to alleviate the contribution of some confounders due to e.g. ethical concerns. Here, we propose a method to consistently derive hypothetical studies that retain as many of the dependencies in the original study as mathematically possible, while removing any association of observed confounders to the independent variables. Using historic studies, we illustrate how the confounding-free scenario re-estimates the effect size of the intervention. The new effect size estimate represents a concise prediction in the hypothetical scenario which paves a way from the original data towards the design of future studies.
We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker's voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual TTS models using only a fraction of paired data as latter.
High-dimensional variable selection, with many more covariates than observations, is widely documented in standard regression models, but there are still few tools to address it in non-linear mixed-effects models where data are collected repeatedly on several individuals. In this work, variable selection is approached from a Bayesian perspective and a selection procedure is proposed, combining the use of a spike-and-slab prior and the Stochastic Approximation version of the Expectation Maximisation (SAEM) algorithm. Similarly to Lasso regression, the set of relevant covariates is selected by exploring a grid of values for the penalisation parameter. The SAEM approach is much faster than a classical MCMC (Markov chain Monte Carlo) algorithm and our method shows very good selection performances on simulated data. Its flexibility is demonstrated by implementing it for a variety of nonlinear mixed effects models. The usefulness of the proposed method is illustrated on a problem of genetic markers identification, relevant for genomic-assisted selection in plant breeding.
The simultaneous orthogonal matching pursuit (SOMP) is a popular, greedy approach for common support recovery of a row-sparse matrix. However, compared to the noiseless scenario, the performance analysis of noisy SOMP is still nascent, especially in the scenario of unbounded noise. In this paper, we present a new study based on the mutual incoherence property (MIP) for performance analysis of noisy SOMP. Specifically, when noise is bounded, we provide the condition on which the exact support recovery is guaranteed in terms of the MIP. When noise is unbounded, we instead derive a bound on the successful recovery probability (SRP) that depends on the specific distribution of the $\ell_2$-norm of the noise matrix. Then we focus on the common case when noise is random Gaussian and show that the lower bound of SRP follows Tracy-Widom law distribution. The analysis reveals the number of measurements, noise level, the number of sparse vectors, and the value of mutual coherence that are required to guarantee a predefined recovery performance. Theoretically, we show that the mutual coherence of the measurement matrix must decrease proportionally to the noise standard deviation, and the number of sparse vectors needs to grow proportionally to the noise variance. Finally, we extensively validate the derived analysis through numerical simulations.
The advent of Generative AI has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in generating realistic images, texts, and data patterns. However, these advancements come with heightened concerns over data privacy and copyright infringement, primarily due to the reliance on vast datasets for model training. Traditional approaches like differential privacy, machine unlearning, and data poisoning only offer fragmented solutions to these complex issues. Our paper delves into the multifaceted challenges of privacy and copyright protection within the data lifecycle. We advocate for integrated approaches that combines technical innovation with ethical foresight, holistically addressing these concerns by investigating and devising solutions that are informed by the lifecycle perspective. This work aims to catalyze a broader discussion and inspire concerted efforts towards data privacy and copyright integrity in Generative AI.