In an era where external data and computational capabilities far exceed statistical agencies' own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods are too fragile to address this new challenge. This article discusses the deployment of a differential privacy framework for the 2020 US Census that was customized to protect confidentiality, particularly the most detailed geographic and demographic categories, and deliver controlled accuracy across the full geographic hierarchy.
We provide a comprehensive elaboration of the theoretical foundations of variable instantiation, or grounding, in Answer Set Programming (ASP). Building on the semantics of ASP's modeling language, we introduce a formal characterization of grounding algorithms in terms of (fixed point) operators. A major role is played by dedicated well-founded operators whose associated models provide semantic guidance for delineating the result of grounding along with on-the-fly simplifications. We address an expressive class of logic programs that incorporates recursive aggregates and thus amounts to the scope of existing ASP modeling languages. This is accompanied with a plain algorithmic framework detailing the grounding of recursive aggregates. The given algorithms correspond essentially to the ones used in the ASP grounder gringo.
It is a well-known fact that there is no complete and discrete invariant on the collection of all multiparameter persistence modules. Nonetheless, many invariants have been proposed in the literature to study multiparameter persistence modules, though each invariant will lose some amount of information. One such invariant is the generalized rank invariant. This invariant is known to be complete on the class of interval decomposable persistence modules in general, under mild assumptions on the indexing poset $P$. There is often a trade-off, where the stronger an invariant is, the more expensive it is to compute in practice. The generalized rank invariant on its own is difficult to compute, whereas the standard rank invariant is readily computable through software implementations such as RIVET. We can interpolate between these two to induce new invariants via restricting the domain of the generalized rank invariant, and this family exhibits the aforementioned trade-off. This work studies the tension which exists between computational efficiency and retaining strength when restricting the domain of the generalized rank invariant. We provide a characterization result on where such restrictions are complete invariants in the setting where $P$ is finite, and furthermore show that such restricted generalized rank invariants are stable.
The idea that social media platforms like Twitter are inhabited by vast numbers of social bots has become widely accepted in recent years. Social bots are assumed to be automated social media accounts operated by malicious actors with the goal of manipulating public opinion. They are credited with the ability to produce content autonomously and to interact with human users. Social bot activity has been reported in many different political contexts, including the U.S. presidential elections, discussions about migration, climate change, and COVID-19. However, the relevant publications either use crude and questionable heuristics to discriminate between supposed social bots and humans or -- in the vast majority of the cases -- fully rely on the output of automatic bot detection tools, most commonly Botometer. In this paper, we point out a fundamental theoretical flaw in the widely-used study design for estimating the prevalence of social bots. Furthermore, we empirically investigate the validity of peer-reviewed Botometer-based studies by closely and systematically inspecting hundreds of accounts that had been counted as social bots. We were unable to find a single social bot. Instead, we found mostly accounts undoubtedly operated by human users, the vast majority of them using Twitter in an inconspicuous and unremarkable fashion without the slightest traces of automation. We conclude that studies claiming to investigate the prevalence, properties, or influence of social bots based on Botometer have, in reality, just investigated false positives and artifacts of this approach.
The ergodic decomposition theorem is a cornerstone result of dynamical systems and ergodic theory. It states that every invariant measure on a dynamical system is a mixture of ergodic ones. Here we formulate and prove the theorem in terms of string diagrams, using the formalism of Markov categories. We recover the usual measure-theoretic statement by instantiating our result in the category of stochastic kernels. Along the way we give a conceptual treatment of several concepts in the theory of deterministic and stochastic dynamical systems. In particular, - ergodic measures appear very naturally as particular cones of deterministic morphisms (in the sense of Markov categories); - the invariant $\sigma$-algebra of a dynamical system can be seen as a colimit in the category of Markov kernels. In line with other uses of category theory, once the necessary structures are in place, our proof of the main theorem is much simpler than traditional approaches. In particular, it does not use any quantitative limiting arguments, and it does not rely on the cardinality of the group or monoid indexing the dynamics. We hope that this result paves the way for further applications of category theory to dynamical systems, ergodic theory, and information theory.
As the central nerve of the intelligent vehicle control system, the in-vehicle network bus is crucial to the security of vehicle driving. One of the best standards for the in-vehicle network is the Controller Area Network (CAN bus) protocol. However, the CAN bus is designed to be vulnerable to various attacks due to its lack of security mechanisms. To enhance the security of in-vehicle networks and promote the research in this area, based upon a large scale of CAN network traffic data with the extracted valuable features, this study comprehensively compared fully-supervised machine learning with semi-supervised machine learning methods for CAN message anomaly detection. Both traditional machine learning models (including single classifier and ensemble models) and neural network based deep learning models are evaluated. Furthermore, this study proposed a deep autoencoder based semi-supervised learning method applied for CAN message anomaly detection and verified its superiority over other semi-supervised methods. Extensive experiments show that the fully-supervised methods generally outperform semi-supervised ones as they are using more information as inputs. Typically the developed XGBoost based model obtained state-of-the-art performance with the best accuracy (98.65%), precision (0.9853), and ROC AUC (0.9585) beating other methods reported in the literature.
The flow-driven spectral chaos (FSC) is a recently developed method for tracking and quantifying uncertainties in the long-time response of stochastic dynamical systems using the spectral approach. The method uses a novel concept called 'enriched stochastic flow maps' as a means to construct an evolving finite-dimensional random function space that is both accurate and computationally efficient in time. In this paper, we present a multi-element version of the FSC method (the ME-FSC method for short) to tackle (mainly) those dynamical systems that are inherently discontinuous over the probability space. In ME-FSC, the random domain is partitioned into several elements, and then the problem is solved separately on each random element using the FSC method. Subsequently, results are aggregated to compute the probability moments of interest using the law of total probability. To demonstrate the effectiveness of the ME-FSC method in dealing with discontinuities and long-time integration of stochastic dynamical systems, four representative numerical examples are presented in this paper, including the Van-der-Pol oscillator problem and the Kraichnan-Orszag three-mode problem. Results show that the ME-FSC method is capable of solving problems that have strong nonlinear dependencies over the probability space, both reliably and at low computational cost.
The aim of this research is to identify an efficient model to describe the fluctuations around the trend of the soil temperatures monitored in the volcanic caldera of the Campi Flegrei area in Naples (Italy). The study focuses on the data concerning the temperatures in the mentioned area through a seven-year period. The research is initially finalized to identify the deterministic component of the model, given by the seasonal trend of the temperatures, which is obtained through an adapted regression method on the time series. Subsequently, the stochastic component from the time series is tested to represent a fractional Brownian motion (fBm). An estimation based on the periodogram of the data is used to estabilish that the data series follows a fBm motion, rather then a fractional Gaussian noise. An estimation of the Hurst exponent $H$ of the process is also obtained. Finally, an inference test based on the detrended moving average of the data is adopted in order to assess the hypothesis that the time series follows a suitably estimated fBm.
The canonical technique for nonlinear modeling of spatial and other point-referenced data is known as kriging in the geostatistics literature, and by Gaussian Process (GP) regression in surrogate modeling and machine learning communities. There are many similarities shared between kriging and GPs, but also some important differences. One is that GPs impose a process on the data-generating mechanism that can be used to automate kernel/variogram inference, thus removing the human from the loop in a conventional semivariogram analysis. The GP framework also suggests a probabilistically valid means of scaling to handle a large corpus of training data, i.e., an alternative to so-called ordinary kriging. Finally, recent GP implementations are tailored to make the most of modern computing architectures such as multi-core workstations and multi-node supercomputers. Ultimately, we use this discussion as a springboard for an empirics-based advocacy of state-of-the-art GP technology in the geospatial modeling of a large corpus of borehole data involved in mining for gold and other minerals. Our out-of-sample validation exercise quantifies how GP methods (as implemented by open source libraries) can be both more economical (fewer human and compute resources), more accurate and offer better uncertainty quantification than kriging-based alternatives. Once in the GP framework, several possible extensions benefit from a fully generative modeling apparatus. In particular, we showcase a simple imputation scheme that copes with left-censoring of small measurements, which is a common feature in borehole assays.
In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first modern treatments on matrix decomposition that favored a (block) LU decomposition-the factorization of a matrix into the product of lower and upper triangular matrices. And now, matrix decomposition has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in the complex domain. We refer the reader to literature in the field of linear algebra for a more detailed introduction to the related fields.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.