We introduce MCCE: Monte Carlo sampling of valid and realistic Counterfactual Explanations for tabular data, a novel counterfactual explanation method that generates on-manifold, actionable and valid counterfactuals by modeling the joint distribution of the mutable features given the immutable features and the decision. Unlike other on-manifold methods that tend to rely on variational autoencoders and have strict prediction model and data requirements, MCCE handles any type of prediction model and categorical features with more than two levels. MCCE first models the joint distribution of the features and the decision with an autoregressive generative model where the conditionals are estimated using decision trees. Then, it samples a large set of observations from this model, and finally, it removes the samples that do not obey certain criteria. We compare MCCE with a range of state-of-the-art on-manifold counterfactual methods using four well-known data sets and show that MCCE outperforms these methods on all common performance metrics and speed. In particular, including the decision in the modeling process improves the efficiency of the method substantially.
This paper presents a novel stochastic optimisation methodology to perform empirical Bayesian inference in semi-blind image deconvolution problems. Given a blurred image and a parametric class of possible operators, the proposed optimisation approach automatically calibrates the parameters of the blur model by maximum marginal likelihood estimation, followed by (non-blind) image deconvolution by maximum-a-posteriori estimation conditionally to the estimated model parameters. In addition to the blur model, the proposed approach also automatically calibrates the noise variance as well as any regularisation parameters. The marginal likelihood of the blur, noise variance, and regularisation parameters is generally computationally intractable, as it requires calculating several integrals over the entire solution space. Our approach addresses this difficulty by using a stochastic approximation proximal gradient optimisation scheme, which iteratively solves such integrals by using a Moreau-Yosida regularised unadjusted Langevin Markov chain Monte Carlo algorithm. This optimisation strategy can be easily and efficiently applied to any model that is log-concave, and by using the same gradient and proximal operators that are required to compute the maximum-a-posteriori solution by convex optimisation. We provide convergence guarantees for the proposed optimisation scheme under realistic and easily verifiable conditions and subsequently demonstrate the effectiveness of the approach with a series of deconvolution experiments and comparisons with alternative strategies from the state of the art.
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to 1). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, that history-dependent (Markovian) policies strictly outperform stationary policies for average optimality in s-rectangular RMDPs. We also study Blackwell optimality for sa-rectangular RMDPs, where we show that {\em approximate} Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games.
With the increasing availability of large scale datasets, computational power and tools like automatic differentiation and expressive neural network architectures, sequential data are now often treated in a data-driven way, with a dynamical model trained from the observation data. While neural networks are often seen as uninterpretable black-box architectures, they can still benefit from physical priors on the data and from mathematical knowledge. In this paper, we use a neural network architecture which leverages the long-known Koopman operator theory to embed dynamical systems in latent spaces where their dynamics can be described linearly, enabling a number of appealing features. We introduce methods that enable to train such a model for long-term continuous reconstruction, even in difficult contexts where the data comes in irregularly-sampled time series. The potential for self-supervised learning is also demonstrated, as we show the promising use of trained dynamical models as priors for variational data assimilation techniques, with applications to e.g. time series interpolation and forecasting.
We show that the known list-decoding algorithms for univariate multiplicity and folded Reed-Solomon (FRS) codes can be made to run in nearly-linear time. This yields, to our knowledge, the first known family of codes that can be decoded in nearly linear time, even as they approach the list decoding capacity. Univariate multiplicity codes and FRS codes are natural variants of Reed-Solomon codes that were discovered and studied for their applications to list-decoding. It is known that for every $\epsilon >0$, and rate $R \in (0,1)$, there exist explicit families of these codes that have rate $R$ and can be list-decoded from a $(1-R-\epsilon)$ fraction of errors with constant list size in polynomial time (Guruswami & Wang (IEEE Trans. Inform. Theory, 2013) and Kopparty, Ron-Zewi, Saraf & Wootters (SIAM J. Comput. 2023)). In this work, we present randomized algorithms that perform the above tasks in nearly linear time. Our algorithms have two main components. The first builds upon the lattice-based approach of Alekhnovich (IEEE Trans. Inf. Theory 2005), who designed a nearly linear time list-decoding algorithm for Reed-Solomon codes approaching the Johnson radius. As part of the second component, we design nearly-linear time algorithms for two natural algebraic problems. The first algorithm solves linear differential equations of the form $Q\left(x, f(x), \frac{df}{dx}, \dots,\frac{d^m f}{dx^m}\right) \equiv 0$ where $Q$ has the form $Q(x,y_0,\dots,y_m) = \tilde{Q}(x) + \sum_{i = 0}^m Q_i(x)\cdot y_i$. The second solves functional equations of the form $Q\left(x, f(x), f(\gamma x), \dots,f(\gamma^m x)\right) \equiv 0$ where $\gamma$ is a high-order field element. These algorithms can be viewed as generalizations of classical algorithms of Sieveking (Computing 1972) and Kung (Numer. Math. 1974) for computing the modular inverse of a power series, and might be of independent interest.
We consider the weak convergence of the Euler-Maruyama approximation for Schr\"odinger-F\"ollmer diffusions, which are solutions of Schr\"odinger bridge problems and can be used for sampling from given distributions. We show that the distribution of the terminal random variable of the time-discretized process weakly converges to the target one under mild regularity conditions.
The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at //github.com/k2-fsa/icefall.
C-based interpreters such as CPython make extensive use of C "extension" code, which is opaque to static analysis tools and faster runtimes with JIT compilers, such as PyPy. Not only are the extensions opaque, but the interface between the dynamic language types and the C types can introduce impedance. We hypothesise that frequent calls to C extension code introduce significant overhead that is often unnecessary. We validate this hypothesis by introducing a simple technique, "typed methods", which allow selected C extension functions to have additional metadata attached to them in a backward-compatible way. This additional metadata makes it much easier for a JIT compiler (and as we show, even an interpreter!) to significantly reduce the call and return overhead. Although we have prototyped typed methods in PyPy, we suspect that the same technique is applicable to a wider variety of language runtimes and that the information can also be consumed by static analysis tooling.
Emotion datasets used for Speech Emotion Recognition (SER) often contain acted or elicited speech, limiting their applicability in real-world scenarios. In this work, we used the Emotional Voice Messages (EMOVOME) database, including spontaneous voice messages from conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We created speaker-independent SER models using the eGeMAPS features, transformer-based models and their combination. We compared the results with reference databases and analyzed the influence of annotators and gender fairness. The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in the prediction of emotion categories, while similar results were obtained in valence and arousal. Additionally, EMOVOME outcomes varied with annotator labels, showing superior results and better fairness when combining expert and non-expert annotations. This study significantly contributes to the evaluation of SER models in real-life situations, advancing in the development of applications for analyzing spontaneous voice messages.
Consider a diffusion process X=(X_t), with t in [0,1], observed at discrete times and high frequency, solution of a stochastic differential equation whose drift and diffusion coefficients are assumed to be unknown. In this article, we focus on the nonparametric esstimation of the diffusion coefficient. We propose ridge estimators of the square of the diffusion coefficient from discrete observations of X and that are obtained by minimization of the least squares contrast. We prove that the estimators are consistent and derive rates of convergence as the size of the sample paths tends to infinity, and the discretization step of the time interval [0,1] tend to zero. The theoretical results are completed with a numerical study over synthetic data.
We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.