With the increasing adoption of digitization, more and more historical visualizations created hundreds of years ago are accessible in digital libraries online. It provides a unique opportunity for visualization and history research. Meanwhile, there is no large-scale digital collection dedicated to historical visualizations. The visualizations are scattered in various collections, which hinders retrieval. In this study, we curate the first large-scale dataset dedicated to historical visualizations. Our dataset comprises 13K historical visualization images with corresponding processed metadata from seven digital libraries. In curating the dataset, we propose a workflow to scrape and process heterogeneous metadata. We develop a semi-automatic labeling approach to distinguish visualizations from other artifacts. Our dataset can be accessed with OldVisOnline, a system we have built to browse and label historical visualizations. We discuss our vision of usage scenarios and research opportunities with our dataset, such as textual criticism for historical visualizations. Drawing upon our experience, we summarize recommendations for future efforts to improve our dataset.
We study a sender-receiver model where the receiver can commit to a decision rule before the sender determines the information policy. The decision rule can depend on the signal structure and the signal realization that the sender adopts. This framework captures applications where a decision-maker (the receiver) solicit advice from an interested party (sender). In these applications, the receiver faces uncertainty regarding the sender's preferences and the set of feasible signal structures. Consequently, we adopt a unified robust analysis framework that includes max-min utility, min-max regret, and min-max approximation ratio as special cases. We show that it is optimal for the receiver to sacrifice ex-post optimality to perfectly align the sender's incentive. The optimal decision rule is a quota rule, i.e., the decision rule maximizes the receiver's ex-ante payoff subject to the constraint that the marginal distribution over actions adheres to a consistent quota, regardless of the sender's chosen signal structure.
Neural collapse provides an elegant mathematical characterization of learned last layer representations (a.k.a. features) and classifier weights in deep classification models. Such results not only provide insights but also motivate new techniques for improving practical deep models. However, most of the existing empirical and theoretical studies in neural collapse focus on the case that the number of classes is small relative to the dimension of the feature space. This paper extends neural collapse to cases where the number of classes are much larger than the dimension of feature space, which broadly occur for language models, retrieval systems, and face recognition applications. We show that the features and classifier exhibit a generalized neural collapse phenomenon, where the minimum one-vs-rest margins is maximized.We provide empirical study to verify the occurrence of generalized neural collapse in practical deep neural networks. Moreover, we provide theoretical study to show that the generalized neural collapse provably occurs under unconstrained feature model with spherical constraint, under certain technical conditions on feature dimension and number of classes.
Submarine cables constitute the backbone of the Internet. However, these critical infrastructure components are vulnerable to several natural and man-made threats, and during failures, are difficult to repair in their remote oceanic environments. In spite of their crucial role, we have a limited understanding of the impact of submarine cable failures on global connectivity, particularly on the higher layers of the Internet. In this paper, we present Nautilus, a framework for cross-layer cartography of submarine cables and IP links. Using a corpus of public datasets and Internet cartographic techniques, Nautilus identifies IP links that are likely traversing submarine cables and maps them to one or more potential cables. Nautilus also gives each IP to cable assignment a prediction score that reflects the confidence in the mapping. Nautilus generates a mapping for 3.05 million and 1.43 million IPv4 and IPv6 links respectively, covering 91% of all active cables. In the absence of ground truth data, we validate Nautilus mapping using three techniques: analyzing past cable failures, using targeted traceroute measurements, and comparing with public network maps of two operators.
As the complexity and scale of modern computer networks continue to increase, there has emerged an urgent need for precise traffic analysis, which plays a pivotal role in cutting-edge wireless connectivity technologies. This study focuses on leveraging Machine Learning methodologies to create an advanced network traffic classification system. We introduce a novel data-driven approach that excels in identifying various network service types in real-time, by analyzing patterns within the network traffic. Our method organizes similar kinds of network traffic into distinct categories, referred to as network services, based on latency requirement. Furthermore, it decomposes the network traffic stream into multiple, smaller traffic flows, with each flow uniquely carrying a specific service. Our ML models are trained on a dataset comprised of labeled examples representing different network service types collected on various Wi-Fi network conditions. Upon evaluation, our system demonstrates a remarkable accuracy in distinguishing the network services. These results emphasize the substantial promise of integrating Artificial Intelligence in wireless technologies. Such an approach encourages more efficient energy consumption, enhances Quality of Service assurance, and optimizes the allocation of network resources, thus laying a solid groundwork for the development of advanced intelligent networks.
The overwhelming volume of data generated and indexed by search engines poses a significant challenge in retrieving documents from the index efficiently and effectively. Even with a well-crafted query, several relevant documents often get buried among a multitude of competing documents, resulting in reduced accessibility or `findability' of the desired document. Consequently, it is crucial to develop a robust methodology for assessing this dimension of Information Retrieval (IR) system performance. While previous studies have focused on measuring document accessibility disregarding user queries and document relevance, there exists no metric to quantify the findability of a document within a given IR system without resorting to manual labor. This paper aims to address this gap by defining and deriving a metric to evaluate the findability of documents as perceived by end-users. Through experiments, we demonstrate the varying impact of different retrieval models and collections on the findability of documents. Furthermore, we establish the findability measure as an independent metric distinct from retrievability, an accessibility measure introduced in prior literature.
Differential privacy (DP), as a rigorous mathematical definition quantifying privacy leakage, has become a well-accepted standard for privacy protection. Combined with powerful machine learning techniques, differentially private machine learning (DPML) is increasingly important. As the most classic DPML algorithm, DP-SGD incurs a significant loss of utility, which hinders DPML's deployment in practice. Many studies have recently proposed improved algorithms based on DP-SGD to mitigate utility loss. However, these studies are isolated and cannot comprehensively measure the performance of improvements proposed in algorithms. More importantly, there is a lack of comprehensive research to compare improvements in these DPML algorithms across utility, defensive capabilities, and generalizability. We fill this gap by performing a holistic measurement of improved DPML algorithms on utility and defense capability against membership inference attacks (MIAs) on image classification tasks. We first present a taxonomy of where improvements are located in the machine learning life cycle. Based on our taxonomy, we jointly perform an extensive measurement study of the improved DPML algorithms. We also cover state-of-the-art label differential privacy (Label DP) algorithms in the evaluation. According to our empirical results, DP can effectively defend against MIAs, and sensitivity-bounding techniques such as per-sample gradient clipping play an important role in defense. We also explore some improvements that can maintain model utility and defend against MIAs more effectively. Experiments show that Label DP algorithms achieve less utility loss but are fragile to MIAs. To support our evaluation, we implement a modular re-usable software, DPMLBench, which enables sensitive data owners to deploy DPML algorithms and serves as a benchmark tool for researchers and practitioners.
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the sketched ridgeless least square estimator. Our findings challenge conventional beliefs by showing that downsampling does not always harm generalization but can actually improve it in certain cases. We identify the optimal sketching size that minimizes out-of-sample prediction risks and demonstrate that the optimally sketched estimator exhibits stabler risk curves, eliminating the peaks of those for the full-sample estimator. To facilitate practical implementation, we propose an empirical procedure to determine the optimal sketching size. Finally, we extend our analysis to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
The advent of large language models marks a revolutionary breakthrough in artificial intelligence. With the unprecedented scale of training and model parameters, the capability of large language models has been dramatically improved, leading to human-like performances in understanding, language synthesizing, and common-sense reasoning, etc. Such a major leap-forward in general AI capacity will change the pattern of how personalization is conducted. For one thing, it will reform the way of interaction between humans and personalization systems. Instead of being a passive medium of information filtering, large language models present the foundation for active user engagement. On top of such a new foundation, user requests can be proactively explored, and user's required information can be delivered in a natural and explainable way. For another thing, it will also considerably expand the scope of personalization, making it grow from the sole function of collecting personalized information to the compound function of providing personalized services. By leveraging large language models as general-purpose interface, the personalization systems may compile user requests into plans, calls the functions of external tools to execute the plans, and integrate the tools' outputs to complete the end-to-end personalization tasks. Today, large language models are still being developed, whereas the application in personalization is largely unexplored. Therefore, we consider it to be the right time to review the challenges in personalization and the opportunities to address them with LLMs. In particular, we dedicate this perspective paper to the discussion of the following aspects: the development and challenges for the existing personalization system, the newly emerged capabilities of large language models, and the potential ways of making use of large language models for personalization.
An effective and efficient architecture performance evaluation scheme is essential for the success of Neural Architecture Search (NAS). To save computational cost, most of existing NAS algorithms often train and evaluate intermediate neural architectures on a small proxy dataset with limited training epochs. But it is difficult to expect an accurate performance estimation of an architecture in such a coarse evaluation way. This paper advocates a new neural architecture evaluation scheme, which aims to determine which architecture would perform better instead of accurately predict the absolute architecture performance. Therefore, we propose a \textbf{relativistic} architecture performance predictor in NAS (ReNAS). We encode neural architectures into feature tensors, and further refining the representations with the predictor. The proposed relativistic performance predictor can be deployed in discrete searching methods to search for the desired architectures without additional evaluation. Experimental results on NAS-Bench-101 dataset suggests that, sampling 424 ($0.1\%$ of the entire search space) neural architectures and their corresponding validation performance is already enough for learning an accurate architecture performance predictor. The accuracies of our searched neural architectures on NAS-Bench-101 and NAS-Bench-201 datasets are higher than that of the state-of-the-art methods and show the priority of the proposed method.
Recent years have witnessed the enormous success of low-dimensional vector space representations of knowledge graphs to predict missing facts or find erroneous ones. Currently, however, it is not yet well-understood how ontological knowledge, e.g. given as a set of (existential) rules, can be embedded in a principled way. To address this shortcoming, in this paper we introduce a framework based on convex regions, which can faithfully incorporate ontological knowledge into the vector space embedding. Our technical contribution is two-fold. First, we show that some of the most popular existing embedding approaches are not capable of modelling even very simple types of rules. Second, we show that our framework can represent ontologies that are expressed using so-called quasi-chained existential rules in an exact way, such that any set of facts which is induced using that vector space embedding is logically consistent and deductively closed with respect to the input ontology.