Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.
Probabilistic couplings are the foundation for many probabilistic relational program logics and arise when relating random sampling statements across two programs. In relational program logics, this manifests as dedicated coupling rules that, e.g., say we may reason as if two sampling statements return the same value. However, this approach fundamentally requires aligning or "synchronizing" the sampling statements of the two programs which is not always possible. In this paper, we develop Clutch, a higher-order probabilistic relational separation logic that addresses this issue by supporting asynchronous probabilistic couplings. We use Clutch to develop a logical step-indexed logical relational to reason about contextual refinement and equivalence of higher-order programs written in a rich language with higher-order local state and impredicative polymorphism. Finally, we demonstrate the usefulness of our approach on a number of case studies. All the results that appear in the paper have been formalized in the Coq proof assistant using the Coquelicot library and the Iris separation logic framework.
Adaptive live video streaming applications use a fixed predefined configuration for the bitrate ladder with constant framerate and encoding presets in a session. However, selecting optimized framerates and presets for every bitrate ladder representation can enhance perceptual quality, improve computational resource allocation, and thus, the streaming energy efficiency. In particular, low framerates for low-bitrate representations reduce compression artifacts and decrease encoding energy consumption. In addition, an optimized preset may lead to improved compression efficiency. To this light, this paper proposes a Content-adaptive Variable Framerate (CVFR) encoding scheme, which offers two modes of operation: ecological (ECO) and high-quality (HQ). CVFR-ECO optimizes for the highest encoding energy savings by predicting the optimized framerate for each representation in the bitrate ladder. CVFR-HQ takes it further by predicting each representation's optimized framerate-encoding preset pair using low-complexity discrete cosine transform energy-based spatial and temporal features for compression efficiency and sustainable storage. We demonstrate the advantage of CVFR using the x264 open-source video encoder. The results show that CVFR-ECO yields an average PSNR and VMAF increase of 0.02 dB and 2.50 points, respectively, for the same bitrate, compared to the fastest preset highest framerate encoding. CVFR-ECO also yields an average encoding and storage energy consumption reduction of 34.54% and 76.24%, considering a just noticeable difference (JND) of six VMAF points. In comparison, CVFR-HQ yields an average increase in PSNR and VMAF of 2.43 dB and 10.14 points, respectively, for the same bitrate. Finally, CVFR-HQ resulted in an average reduction in storage energy consumption of 83.18%, considering a JND of six VMAF points.
Interpreting EEG signals linked to spoken language presents a complex challenge, given the data's intricate temporal and spatial attributes, as well as the various noise factors. Denoising diffusion probabilistic models (DDPMs), which have recently gained prominence in diverse areas for their capabilities in representation learning, are explored in our research as a means to address this issue. Using DDPMs in conjunction with a conditional autoencoder, our new approach considerably outperforms traditional machine learning algorithms and established baseline models in accuracy. Our results highlight the potential of DDPMs as a sophisticated computational method for the analysis of speech-related EEG signals. This could lead to significant advances in brain-computer interfaces tailored for spoken communication.
We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two. Although our work is mainly theoretical, we perform numerical experiments showing non-vacuous generalization bounds for Wasserstein GANs on synthetic datasets.
Structured additive distributional regression models offer a versatile framework for estimating complete conditional distributions by relating all parameters of a parametric distribution to covariates. Although these models efficiently leverage information in vast and intricate data sets, they often result in highly-parameterized models with many unknowns. Standard estimation methods, like Bayesian approaches based on Markov chain Monte Carlo methods, face challenges in estimating these models due to their complexity and costliness. To overcome these issues, we suggest a fast and scalable alternative based on variational inference. Our approach combines a parsimonious parametric approximation for the posteriors of regression coefficients, with the exact conditional posterior for hyperparameters. For optimization, we use a stochastic gradient ascent method combined with an efficient strategy to reduce the variance of estimators. We provide theoretical properties and investigate global and local annealing to enhance robustness, particularly against data outliers. Our implementation is very general, allowing us to include various functional effects like penalized splines or complex tensor product interactions. In a simulation study, we demonstrate the efficacy of our approach in terms of accuracy and computation time. Lastly, we present two real examples illustrating the modeling of infectious COVID-19 outbreaks and outlier detection in brain activity.
Categorizing source codes accurately and efficiently is a challenging problem in real-world programming education platform management. In recent years, model-based approaches utilizing abstract syntax trees (ASTs) have been widely applied to code classification tasks. We introduce an approach named the Sparse Attention-based neural network for Code Classification (SACC) in this paper. The approach involves two main steps: In the first step, source code undergoes syntax parsing and preprocessing. The generated abstract syntax tree is split into sequences of subtrees and then encoded using a recursive neural network to obtain a high-dimensional representation. This step simultaneously considers both the logical structure and lexical level information contained within the code. In the second step, the encoded sequences of subtrees are fed into a Transformer model that incorporates sparse attention mechanisms for the purpose of classification. This method efficiently reduces the computational cost of the self-attention mechanisms, thus improving the training speed while preserving effectiveness. Our work introduces a carefully designed sparse attention pattern that is specifically designed to meet the unique needs of code classification tasks. This design helps reduce the influence of redundant information and enhances the overall performance of the model. Finally, we also deal with problems in previous related research, which include issues like incomplete classification labels and a small dataset size. We annotated the CodeNet dataset with algorithm-related labeling categories, which contains a significantly large amount of data. Extensive comparative experimental results demonstrate the effectiveness and efficiency of SACC for the code classification tasks.
To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences. Natural policy gradient methods, including Trust Region Policy Optimization (TRPO), seek to produce monotonic improvement through bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a commonly used, first-order algorithm that instead uses loss clipping to take multiple safe optimization steps per batch of data, replacing the bound on the single step of TRPO with regularization on multiple steps. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. Instead of the importance sampling objective of PPO, we instead recommend a basic policy gradient, clipped in an equivalent fashion. While both objectives produce biased gradient estimates with respect to the RL objective, they also both display significantly reduced variance compared to the unbiased off-policy policy gradient. Additionally, we show that (1) the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration. As a result, we empirically observe that COPG produces improved learning compared to PPO in single-task, constrained, and multi-task learning, without adding significant computational cost or complexity. Compared to TRPO, the COPG approach is seen to offer comparable or superior performance, while retaining the simplicity of a first-order method.
Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation from real-world data. However, previous methods either rely on texture information, which is not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less clothes with large deformation. Our key idea is to effectively leverage a shape prior learned from pre-captured clothing using diffusion models. We also propose a multi-stage guidance scheme based on learned functional maps, which stabilizes registration for large-scale deformation even when they vary significantly from training data. Using high-fidelity real captured clothes, our experiments show that the proposed approach based on diffusion models generalizes better than surface registration with VAE or PCA-based priors, outperforming both optimization-based and learning-based non-rigid registration methods for both interpolation and extrapolation tests.
Current models for event causality identification (ECI) mainly adopt a supervised framework, which heavily rely on labeled data for training. Unfortunately, the scale of current annotated datasets is relatively limited, which cannot provide sufficient support for models to capture useful indicators from causal statements, especially for handing those new, unseen cases. To alleviate this problem, we propose a novel approach, shortly named CauSeRL, which leverages external causal statements for event causality identification. First of all, we design a self-supervised framework to learn context-specific causal patterns from external causal statements. Then, we adopt a contrastive transfer strategy to incorporate the learned context-specific causal patterns into the target ECI model. Experimental results show that our method significantly outperforms previous methods on EventStoryLine and Causal-TimeBank (+2.0 and +3.4 points on F1 value respectively).
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at //github.com/facebookresearch/SlowFast