Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline.
We introduce a fine-grained framework for uncertainty quantification of predictive models under distributional shifts. This framework distinguishes the shift in covariate distributions from that in the conditional relationship between the outcome (Y) and the covariates (X). We propose to reweight the training samples to adjust for an identifiable covariate shift while protecting against worst-case conditional distribution shift bounded in an $f$-divergence ball. Based on ideas from conformal inference and distributionally robust learning, we present an algorithm that outputs (approximately) valid and efficient prediction intervals in the presence of distributional shifts. As a use case, we apply the framework to sensitivity analysis of individual treatment effects with hidden confounding. The proposed methods are evaluated in simulation studies and three real data applications, demonstrating superior robustness and efficiency compared with existing benchmarks.
Modern regression applications can involve hundreds or thousands of variables which motivates the use of variable selection methods. Bayesian variable selection defines a posterior distribution on the possible subsets of the variables (which are usually termed models) to express uncertainty about which variables are strongly linked to the response. This can be used to provide Bayesian model averaged predictions or inference, and to understand the relative importance of different variables. However, there has been little work on meaningful representations of this uncertainty beyond first order summaries. We introduce Cartesian credible sets to address this gap. The elements of these sets are formed by concatenating sub-models defined on each block of a partition of the variables. Investigating these sub-models allow us to understand whether the models in the Cartesian credible set always/never/sometimes include a particular variable or group of variables and provide a useful summary of model uncertainty. We introduce methods to find these sets that emphasize ease of understanding. The potential of the method is illustrated on regression problems with both small and large numbers of variables.
A major challenge in computed tomography is reconstructing objects from incomplete data. An increasingly popular solution for these problems is to incorporate deep learning models into reconstruction algorithms. This study introduces a novel approach by integrating a Fourier neural operator (FNO) into the Filtered Backprojection (FBP) reconstruction method, yielding the FNO back projection (FNO-BP) network. We employ moment conditions for sinogram extrapolation to assist the model in mitigating artefacts from limited data. Notably, our deep learning architecture maintains a runtime comparable to classical filtered back projection (FBP) reconstructions, ensuring swift performance during both inference and training. We assess our reconstruction method in the context of the Helsinki Tomography Challenge 2022 and also compare it against regular FBP methods.
Local variable selection aims to discover localized effects by assessing the impact of covariates on outcomes within specific regions defined by other covariates. We outline some challenges of local variable selection in the presence of non-linear relationships and model misspecification. Specifically, we highlight a potential drawback of common semi-parametric methods: even slight model misspecification can result in a high rate of false positives. To address these shortcomings, we propose a methodology based on orthogonal cut splines that achieves consistent local variable selection in high-dimensional scenarios. Our approach offers simplicity, handles both continuous and discrete covariates, and provides theory for high-dimensional covariates and model misspecification. We discuss settings with either independent or dependent data. Our proposal allows including adjustment covariates that do not undergo selection, enhancing flexibility in modeling complex scenarios. We illustrate its application in simulation studies with both independent and functional data, as well as with two real datasets. One dataset evaluates salary gaps associated with discrimination factors at different ages, while the other examines the effects of covariates on brain activation over time. The approach is implemented in the R package mombf.
In semantic segmentation, training data down-sampling is commonly performed due to limited resources, the need to adapt image size to the model input, or improve data augmentation. This down-sampling typically employs different strategies for the image data and the annotated labels. Such discrepancy leads to mismatches between the down-sampled color and label images. Hence, the training performance significantly decreases as the down-sampling factor increases. In this paper, we bring together the down-sampling strategies for the image data and the training labels. To that aim, we propose a novel framework for label down-sampling via soft-labeling that better conserves label information after down-sampling. Therefore, fully aligning soft-labels with image data to keep the distribution of the sampled pixels. This proposal also produces reliable annotations for under-represented semantic classes. Altogether, it allows training competitive models at lower resolutions. Experiments show that the proposal outperforms other down-sampling strategies. Moreover, state-of-the-art performance is achieved for reference benchmarks, but employing significantly less computational resources than foremost approaches. This proposal enables competitive research for semantic segmentation under resource constraints.
Fourth-order variational inequalities are encountered in various scientific and engineering disciplines, including elliptic optimal control problems and plate obstacle problems. In this paper, we consider additive Schwarz methods for solving fourth-order variational inequalities. Based on a unified framework of various finite element methods for fourth-order variational inequalities, we develop one- and two-level additive Schwarz methods. We prove that the two-level method is scalable in the sense that the convergence rate of the method depends on $H/h$ and $H/\delta$ only, where $h$ and $H$ are the typical diameters of an element and a subdomain, respectively, and $\delta$ measures the overlap among the subdomains. This proof relies on a new nonlinear positivity-preserving coarse interpolation operator, the construction of which was previously unknown. To the best of our knowledge, this analysis represents the first investigation into the scalability of the two-level additive Schwarz method for fourth-order variational inequalities. Our theoretical results are verified by numerical experiments.
In a regression model with multiple response variables and multiple explanatory variables, if the difference of the mean vectors of the response variables for different values of explanatory variables is always in the direction of the first principal eigenvector of the covariance matrix of the response variables, then it is called a multivariate allometric regression model. This paper studies the estimation of the first principal eigenvector in the multivariate allometric regression model. A class of estimators that includes conventional estimators is proposed based on weighted sum-of-squares matrices of regression sum-of-squares matrix and residual sum-of-squares matrix. We establish an upper bound of the mean squared error of the estimators contained in this class, and the weight value minimizing the upper bound is derived. Sufficient conditions for the consistency of the estimators are discussed in weak identifiability regimes under which the difference of the largest and second largest eigenvalues of the covariance matrix decays asymptotically and in ``large $p$, large $n$" regimes, where $p$ is the number of response variables and $n$ is the sample size. Several numerical results are also presented.
The presence of faulty or underactuated manipulators can disrupt the end-effector formation keeping of a team of manipulators. Based on two-link planar manipulators, we investigate this end-effector formation keeping problem for mixed fully- and under-actuated manipulators with flexible joints. In this case, the underactuated manipulators can comprise of active-passive (AP) manipulators, passive-active (PA) manipulators, or a combination thereof. We propose distributed control laws for the different types of manipulators to achieve and maintain the desired formation shape of the end-effectors. It is achieved by assigning virtual springs to the end-effectors for the fully-actuated ones and to the virtual end-effectors for the under-actuated ones. We study further the set of all desired and reachable shapes for the networked manipulators' end-effectors. Finally, we validate our analysis via numerical simulations.
A discretization method with non-matching grids is proposed for the coupled Stokes-Darcy problem that uses a mortar variable at the interface to couple the marker and cell (MAC) method in the Stokes domain with the Raviart-Thomas mixed finite element pair in the Darcy domain. Due to this choice, the method conserves linear momentum and mass locally in the Stokes domain and exhibits local mass conservation in the Darcy domain. The MAC scheme is reformulated as a mixed finite element method on a staggered grid, which allows for the proposed scheme to be analyzed as a mortar mixed finite element method. We show that the discrete system is well-posed and derive a priori error estimates that indicate first order convergence in all variables. The system can be reduced to an interface problem concerning only the mortar variables, leading to a non-overlapping domain decomposition method. Numerical examples are presented to illustrate the theoretical results and the applicability of the method.
Diffusion models have demonstrated remarkable performance in generation tasks. Nevertheless, explaining the diffusion process remains challenging due to it being a sequence of denoising noisy images that are difficult for experts to interpret. To address this issue, we propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable. We show how the output is progressively generated in the diffusion process by explaining the level of denoising and highlighting relationships to foundational visual concepts at each time step through the results of experiments with various visual analyses using the tools. Throughout the training of the diffusion model, the model learns diverse visual concepts corresponding to each time-step, enabling the model to predict varying levels of visual concepts at different stages. We substantiate our tools using Area Under Cover (AUC) score, correlation quantification, and cross-attention mapping. Our findings provide insights into the diffusion process and pave the way for further research into explainable diffusion mechanisms.