We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a sequence of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest. Our first contributions are to define a natural measure of the incompatibility of a sequence of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By studying the concentration properties of the natural plug-in estimator of this measure, we introduce novel hypothesis tests that we prove have power against all distributions with incompatible covariance matrices. The choice of critical values for our tests rely on a new concentration inequality for the Pearson sample correlation matrix, which may be of interest more widely. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed.
We prove explicit uniform two-sided bounds for the phase functions of Bessel functions and of their derivatives. As a consequence, we obtain new enclosures for the zeros of Bessel functions and their derivatives in terms of inverse values of some elementary functions. These bounds are valid, with a few exceptions, for all zeros and all Bessel functions with non-negative indices. We provide numerical evidence showing that our bounds either improve or closely match the best previously known ones.
Digital credentials represent a cornerstone of digital identity on the Internet. To achieve privacy, certain functionalities in credentials should be implemented. One is selective disclosure, which allows users to disclose only the claims or attributes they want. This paper presents a novel approach to selective disclosure that combines Merkle hash trees and Boneh-Lynn-Shacham (BLS) signatures. Combining these approaches, we achieve selective disclosure of claims in a single credential and creation of a verifiable presentation containing selectively disclosed claims from multiple credentials signed by different parties. Besides selective disclosure, we enable issuing credentials signed by multiple issuers using this approach.
We propose a fast probabilistic framework for identifying differential equations governing the dynamics of observed data. We recast the SINDy method within a Bayesian framework and use Gaussian approximations for the prior and likelihood to speed up computation. The resulting method, Bayesian-SINDy, not only quantifies uncertainty in the parameters estimated but also is more robust when learning the correct model from limited and noisy data. Using both synthetic and real-life examples such as Lynx-Hare population dynamics, we demonstrate the effectiveness of the new framework in learning correct model equations and compare its computational and data efficiency with existing methods. Because Bayesian-SINDy can quickly assimilate data and is robust against noise, it is particularly suitable for biological data and real-time system identification in control. Its probabilistic framework also enables the calculation of information entropy, laying the foundation for an active learning strategy.
Knowing which countries contribute the most to pushing the boundaries of knowledge in science and technology has social and political importance. However, common citation metrics do not adequately measure this contribution. This measure requires more stringent metrics appropriate for the highly influential breakthrough papers that push the boundaries of knowledge, which are very highly cited but very rare. Here I used the recently described Rk index, specifically designed to address this issue. I applied this index to 25 countries and the EU across 10 key research topics, five technological and five biomedical, studying domestic and international collaborative papers independently. In technological topics, the Rk indices of domestic papers show that overall, the USA, China, and the EU are leaders; other countries are clearly behind. The USA is notably ahead of China, and the EU is far behind China. The same approach to biomedical topics shows an overwhelming dominance of the USA and that the EU is ahead of China. The analysis of internationally collaborative papers further demonstrates the US dominance. These results conflict with current country rankings based on less stringent indicators.
In this paper I will develop a lambda-term calculus, lambda-2Int, for a bi-intuitionistic logic and discuss its implications for the notions of sense and denotation of derivations in a bilateralist setting. Thus, I will use the Curry-Howard correspondence, which has been well-established between the simply typed lambda-calculus and natural deduction systems for intuitionistic logic, and apply it to a bilateralist proof system displaying two derivability relations, one for proving and one for refuting. The basis will be the natural deduction system of Wansing's bi-intuitionistic logic 2Int, which I will turn into a term-annotated form. Therefore, we need a type theory that extends to a two-sorted typed lambda-calculus. I will present such a term-annotated proof system for 2Int and prove a Dualization Theorem relating proofs and refutations in this system. On the basis of these formal results I will argue that this gives us interesting insights into questions about sense and denotation as well as synonymy and identity of proofs from a bilateralist point of view.
We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or a maximum a posteriori criterion using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments demonstrate the benefits of our approach.
3D stacked technology has emerged as an effective mechanism to overcome physical limits and communication delays found in 2D integration. However, 3D technology also presents several drawbacks that prevent its smooth application. Two of the major concerns are heat reduction and power density distribution. In our work, we propose a novel 3D thermal-aware floorplanner that includes: (1) an effective thermal-aware process with 3 different evolutionary algorithms that aim to solve the soft computing problem of optimizing the placement of functional units and through silicon vias, as well as the smooth inclusion of active cooling systems and new design strategies,(2) an approximated thermal model inside the optimization loop, (3) an optimizer for active cooling (liquid channels), and (4) a novel technique based on air channel placement designed to isolate thermal domains have been also proposed. The experimental work is conducted for a realistic many-core single-chip architecture based on the Niagara design. Results show promising improvements of the thermal and reliability metrics, and also show optimal scaling capabilities to target future-trend many-core systems.
We introduce the concept of Automated Causal Discovery (AutoCD), defined as any system that aims to fully automate the application of causal discovery and causal reasoning methods. AutoCD's goal is to deliver all causal information that an expert human analyst would and answer a user's causal queries. We describe the architecture of such a platform, and illustrate its performance on synthetic data sets. As a case study, we apply it on temporal telecommunication data. The system is general and can be applied to a plethora of causal discovery problems.
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
Genome assembly is a prominent problem studied in bioinformatics, which computes the source string using a set of its overlapping substrings. Classically, genome assembly uses assembly graphs built using this set of substrings to compute the source string efficiently, having a tradeoff between scalability and avoiding information loss. The scalable de Bruijn graphs come at the price of losing crucial overlap information. The complete overlap information is stored in overlap graphs using quadratic space. Hierarchical overlap graphs [IPL20] (HOG) overcome these limitations, avoiding information loss despite using linear space. After a series of suboptimal improvements, Khan and Park et al. simultaneously presented two optimal algorithms [CPM2021], where only the former was seemingly practical. We empirically analyze all the practical algorithms for computing HOG, where the optimal algorithm [CPM2021] outperforms the previous algorithms as expected, though at the expense of extra memory. However, it uses non-intuitive approach and non-trivial data structures. We present arguably the most intuitive algorithm, using only elementary arrays, which is also optimal. Our algorithm empirically proves even better for both time and memory over all the algorithms, highlighting its significance in both theory and practice. We further explore the applications of hierarchical overlap graphs to solve various forms of suffix-prefix queries on a set of strings. Loukides et al. [CPM2023] recently presented state-of-the-art algorithms for these queries. However, these algorithms require complex black-box data structures and are seemingly impractical. Our algorithms, despite failing to match the state-of-the-art algorithms theoretically, answer different queries ranging from 0.01-100 milliseconds for a data set having around a billion characters.