In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.
Unsupervised question answering is a promising yet challenging task, which alleviates the burden of building large-scale annotated data in a new domain. It motivates us to study the unsupervised multiple-choice question answering (MCQA) problem. In this paper, we propose a novel framework designed to generate synthetic MCQA data barely based on contexts from the universal domain without relying on any form of manual annotation. Possible answers are extracted and used to produce related questions, then we leverage both named entities (NE) and knowledge graphs to discover plausible distractors to form complete synthetic samples. Experiments on multiple MCQA datasets demonstrate the effectiveness of our method.
In this paper, we demonstrate that a new measure of evidence we developed called the Dempster-Shafer p-value which allow for insights and interpretations which retain most of the structure of the p-value while covering for some of the disadvantages that traditional p- values face. Moreover, we show through classical large-sample bounds and simulations that there exists a close connection between our form of DS hypothesis testing and the classical frequentist testing paradigm. We also demonstrate how our approach gives unique insights into the dimensionality of a hypothesis test, as well as models the effects of adversarial attacks on multinomial data. Finally, we demonstrate how these insights can be used to analyze text data for public health through an analysis of the Population Health Metrics Research Consortium dataset for verbal autopsies.
In this work, the high order accuracy and the well-balanced (WB) properties of some novel continuous interior penalty (CIP) stabilizations for the Shallow Water (SW) equations are investigated. The underlying arbitrary high order numerical framework is given by a Residual Distribution (RD)/continuous Galerkin (CG) finite element method (FEM) setting for the space discretization coupled with a Deferred Correction (DeC) time integration, to have a fully-explicit scheme. If, on the one hand, the introduced CIP stabilizations are all specifically designed to guarantee the exact preservation of the lake at rest steady state, on the other hand, some of them make use of general structures to tackle the preservation of general steady states, whose explicit analytical expression is not known. Several basis functions have been considered in the numerical experiments and, in all cases, the numerical results confirm the high order accuracy and the ability of the novel stabilizations to exactly preserve the lake at rest steady state and to capture small perturbations of such equilibrium. Moreover, some of them, based on the notions of space residual and global flux, have shown very good performances and superconvergences in the context of general steady solutions not known in closed-form. Many elements introduced here can be extended to other hyperbolic systems, e.g., to the Euler equations with gravity.
In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave. We analyze several methods of approximate sampling based on discretizations of the (highly overdamped) Langevin diffusion and establish guarantees on its error measured in the Wasserstein-2 distance. Our guarantees improve or extend the state-of-the-art results in three directions. First, we provide an upper bound on the error of the first-order Langevin Monte Carlo (LMC) algorithm with optimized varying step-size. This result has the advantage of being horizon free (we do not need to know in advance the target precision) and to improve by a logarithmic factor the corresponding result for the constant step-size. Second, we study the case where accurate evaluations of the gradient of the log-density are unavailable, but one can have access to approximations of the aforementioned gradient. In such a situation, we consider both deterministic and stochastic approximations of the gradient and provide an upper bound on the sampling error of the first-order LMC that quantifies the impact of the gradient evaluation inaccuracies. Third, we establish upper bounds for two versions of the second-order LMC, which leverage the Hessian of the log-density. We provide nonasymptotic guarantees on the sampling error of these second-order LMCs. These guarantees reveal that the second-order LMC algorithms improve on the first-order LMC in ill-conditioned settings.
Digital credentials represent a cornerstone of digital identity on the Internet. To achieve privacy, certain functionalities in credentials should be implemented. One is selective disclosure, which allows users to disclose only the claims or attributes they want. This paper presents a novel approach to selective disclosure that combines Merkle hash trees and Boneh-Lynn-Shacham (BLS) signatures. Combining these approaches, we achieve selective disclosure of claims in a single credential and creation of a verifiable presentation containing selectively disclosed claims from multiple credentials signed by different parties. Besides selective disclosure, we enable issuing credentials signed by multiple issuers using this approach.
This paper proposes novel high-order accurate discontinuous Galerkin (DG) schemes for the one- and two-dimensional ten-moment Gaussian closure equations with source terms defined by a known potential function. Our DG schemes exhibit the desirable capability of being well-balanced (WB) for a known hydrostatic equilibrium state while simultaneously preserving positive density and positive-definite anisotropic pressure tensor. The well-balancedness is built on carefully modifying the solution states in the Harten-Lax-van Leer-contact (HLLC) flux, and appropriate reformulation and discretization of the source terms. Our novel modification technique overcomes the difficulties posed by the anisotropic effects, maintains the high-order accuracy, and ensures that the modified solution state remains within the physically admissible state set. Positivity-preserving analyses of our WB DG schemes are conducted by using several key properties of the admissible state set, the HLLC flux and the HLLC solver, as well as the geometric quasilinearization (GQL) approach in [Wu & Shu, SIAM Review, 65: 1031-1073, 2023], which was originally applied to analyze the admissible state set and physical-constraints-preserving schemes for the relativistic magnetohydrodynamics in [Wu & Tang, M3AS, 27: 1871-1928, 2017], to address the difficulties arising from the nonlinear constraints on pressure tensor. Moreover, the proposed WB DG schemes satisfy the weak positivity for the cell averages, implying the use of a scaling limiter to enforce the physical admissibility of the DG solution polynomials at certain points of interest. Extensive numerical experiments are conducted to validate the preservation of equilibrium states, accuracy in capturing small perturbations to such states, robustness in solving problems involving low density or low pressure, and high resolution for both smooth and discontinuous solutions.
We give a priori error estimates of second order in time fully explicit Runge-Kutta discontinuous Galerkin schemes using upwind fluxes to smooth solutions of scalar fractional conservation laws in one space dimension. Under the time step restrictions $\tau\leq c h$ for piecewise linear and $\tau\lesssim h^{4/3}$ for higher order finite elements, we prove a convergence rate for the energy norm $\|\cdot\|_{L^\infty_tL^2_x}+|\cdot|_{L^2_tH^{\lambda/2}_x}$ that is optimal for solutions and flux functions that are smooth enough. Our proof relies on a novel upwind projection of the exact solution.
In this paper I will develop a lambda-term calculus, lambda-2Int, for a bi-intuitionistic logic and discuss its implications for the notions of sense and denotation of derivations in a bilateralist setting. Thus, I will use the Curry-Howard correspondence, which has been well-established between the simply typed lambda-calculus and natural deduction systems for intuitionistic logic, and apply it to a bilateralist proof system displaying two derivability relations, one for proving and one for refuting. The basis will be the natural deduction system of Wansing's bi-intuitionistic logic 2Int, which I will turn into a term-annotated form. Therefore, we need a type theory that extends to a two-sorted typed lambda-calculus. I will present such a term-annotated proof system for 2Int and prove a Dualization Theorem relating proofs and refutations in this system. On the basis of these formal results I will argue that this gives us interesting insights into questions about sense and denotation as well as synonymy and identity of proofs from a bilateralist point of view.
In this paper, we develop a new type of Runge--Kutta (RK) discontinuous Galerkin (DG) method for solving hyperbolic conservation laws. Compared with the original RKDG method, the new method features improved compactness and allows simple boundary treatment. The key idea is to hybridize two different spatial operators in an explicit RK scheme, utilizing local projected derivatives for inner RK stages and the usual DG spatial discretization for the final stage only. Limiters are applied only at the final stage for the control of spurious oscillations. We also explore the connections between our method and Lax--Wendroff DG schemes and ADER-DG schemes. Numerical examples are given to confirm that the new RKDG method is as accurate as the original RKDG method, while being more compact, for problems including two-dimensional Euler equations for compressible gas dynamics.
We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or a maximum a posteriori criterion using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments demonstrate the benefits of our approach.