In this paper we delve into the historical evolution of data as a fundamental element in communication and knowledge transmission. The paper traces the stages of knowledge dissemination from oral traditions to the digital era, highlighting the significance of languages and cultural diversity in this progression. It also explores the impact of digital technologies on memory, communication, and cultural preservation, emphasizing the need for promoting a culture of the digital (rather than a digital culture) in Africa and beyond. Additionally, it discusses the challenges and opportunities presented by data biases in AI development, underscoring the importance of creating diverse datasets for equitable representation. We advocate for investing in data as a crucial raw material for fostering digital literacy, economic development, and, above all, cultural preservation in the digital age.
Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
Nonlinear complexity is an important measure for assessing the randomness of sequences. In this paper we investigate how circular shifts affect the nonlinear complexities of finite-length binary sequences and then reveal a more explicit relation between nonlinear complexities of finite-length binary sequences and their corresponding periodic sequences. Based on the relation, we propose two algorithms that can generate all periodic binary sequences with any prescribed nonlinear complexity.
In this work, we present an extensive analysis of clock synchronization algorithms, with a specific focus on message complexity. We begin by introducing fundamental concepts in clock synchronization, such as the Byzantine generals problem and specific concepts like clock accuracy, precision, skew, offset, timestamping, and clock drift estimation. Describing the concept of logical clocks, their implementation in distributed systems is discussed, highlighting their significance and various approaches. The paper then examines four prominent clock synchronization algorithms: Lamport's Algorithm, Ricart-Agrawala Algorithm, Vector Clocks Algorithm, and Christian's Algorithm. Special attention is given to the analysis of message complexity, providing insights into the efficiency of each algorithm. Finally, we compare the message complexities of the discussed algorithms.
In this paper, we introduce a novel communication-assisted sensing (CAS) framework that explores the potential coordination gains offered by the integrated sensing and communication technique. The CAS system endows users with beyond-line-of-the-sight sensing capabilities, supported by a dual-functional base station that enables simultaneous sensing and communication. To delve into the system's fundamental limits, we characterize the information-theoretic framework of the CAS system in terms of rate-distortion theory. We reveal the achievable overall distortion between the target's state and the reconstructions at the end-user, referred to as the sensing quality of service, within a special case where the distortion metric is separable for sensing and communication processes. As a case study, we employ a typical application to demonstrate distortion minimization under the ISAC signaling strategy, showcasing the potential of CAS in enhancing sensing capabilities.
In this paper we study a class of weighted estimands, which we define as parameters that can be expressed as weighted averages of the underlying heterogeneous treatment effects. The popular ordinary least squares (OLS), two-stage least squares (2SLS), and two-way fixed effects (TWFE) estimands are all special cases within our framework. Our focus is on answering two questions concerning weighted estimands. First, under what conditions can they be interpreted as the average treatment effect for some (possibly latent) subpopulation? Second, when these conditions are satisfied, what is the upper bound on the size of that subpopulation, either in absolute terms or relative to a target population of interest? We argue that this upper bound provides a valuable diagnostic for empirical research. When a given weighted estimand corresponds to the average treatment effect for a small subset of the population of interest, we say its internal validity is low. Our paper develops practical tools to quantify the internal validity of weighted estimands.
In this paper, we introduce a novel centrality measure to evaluate shock propagation on financial networks capturing a notion of contagion and systemic risk contributions. In comparison to many popular centrality metrics (e.g., eigenvector centrality) which provide only a relative centrality between nodes, our proposed measure is in an absolute scale permitting comparisons of contagion risk over time. In addition, we provide a statistical validation method when the network is estimated from data, as is done in practice. This statistical test allows us to reliably assess the computed centrality values. We validate our methodology on simulated data and conduct empirical case studies using financial data. We find that our proposed centrality measure increases significantly during times of financial distress and is able to provide insights in to the (market implied) risk-levels of different firms and sectors.
Using tools developed in a recent work by Shen and the second author, in this paper we carry out an in-depth study on the average decoding error probability of the random matrix ensemble over the erasure channel under three decoding principles, namely unambiguous decoding, maximum likelihood decoding and list decoding. We obtain explicit formulas for the average decoding error probabilities of the random matrix ensemble under these three decoding principles and compute the error exponents. Moreover, for unambiguous decoding, we compute the variance of the decoding error probability of the random matrix ensemble and the error exponent of the variance, which imply a strong concentration result, that is, roughly speaking, the ratio of the decoding error probability of a random code in the ensemble and the average decoding error probability of the ensemble converges to 1 with high probability when the code length goes to infinity.
Addressing the critical challenge of ensuring data integrity in decentralized systems, this paper delves into the underexplored area of data falsification probabilities within Merkle Trees, which are pivotal in blockchain and Internet of Things (IoT) technologies. Despite their widespread use, a comprehensive understanding of the probabilistic aspects of data security in these structures remains a gap in current research. Our study aims to bridge this gap by developing a theoretical framework to calculate the probability of data falsification, taking into account various scenarios based on the length of the Merkle path and hash length. The research progresses from the derivation of an exact formula for falsification probability to an approximation suitable for cases with significantly large hash lengths. Empirical experiments validate the theoretical models, exploring simulations with diverse hash lengths and Merkle path lengths. The findings reveal a decrease in falsification probability with increasing hash length and an inverse relationship with longer Merkle paths. A numerical analysis quantifies the discrepancy between exact and approximate probabilities, underscoring the conditions for the effective application of the approximation. This work offers crucial insights into optimizing Merkle Tree structures for bolstering security in blockchain and IoT systems, achieving a balance between computational efficiency and data integrity.
This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from -- or the same as -- the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.
Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This survey provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to a number of applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.