As web archives' holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms: : Archive-It, Conifer, the Croatian Web Archive (HAW), the Internet Archive's user account web archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive (UKWA). We note a plethora of different approaches to web archive collection structures. Some web archive collections support sub-collections and some permit embargoes. Curatorial decisions may be attributed to a single organization or many. Archived web pages are known by many names: mementos, copies, captures, or snapshots. Some platforms restrict a memento to a single collection and others allow mementos to cross collections. Knowledge of collection structures has implications for many different applications and users. Visitors will need to understand how to navigate collections. Future archivists will need to understand what options are available for designing collections. Platform designers need it to know what possibilities exist. The developers of tools that consume collections need to understand collection structures so they can meet the needs of their users.
Applications (apps) of the Digital Sharing Economy (DSE), such as Uber, Airbnb, and TaskRabbit, have become a main enabler of economic growth and shared prosperity in modern-day societies. However, the complex exchange of goods, services, and data that takes place over these apps frequently puts their end-users' privacy at risk. Privacy policies of DSE apps are provided to disclose how private user data is being collected and handled. However, in reality, such policies are verbose and difficult to understand, leaving DSE users vulnerable to privacy intrusive practices. To address these concerns, in this paper, we propose an automated approach for annotating privacy policies in the DSE market. Our approach identifies data collection claims in these policies and maps them to the quality features of their apps. Visual and textual annotations are then used to further explain and justify these claims. The proposed approach is evaluated with 18 DSE app users. The results show that annotating privacy policies can significantly enhance their comprehensibility to the average DSE user. Our findings are intended to help DSE app developers to draft more comprehensible privacy policies as well as help their end-users to make more informed decisions in one of the fastest growing software ecosystems in the world.
With the rapid development of Internet technology, people have more and more access to a variety of web page resources. At the same time, the current rapid development of deep learning technology is often inseparable from the huge amount of Web data resources. On the other hand, NLP is also an important part of data processing technology, such as web page data extraction. At present, the extraction technology of web page text mainly uses a single heuristic function or strategy, and most of them need to determine the threshold manually. With the rapid growth of the number and types of web resources, there are still problems to be solved when using a single strategy to extract the text information of different pages. This paper proposes a web page text extraction algorithm based on multi-feature fusion. According to the text information characteristics of web resources, DOM nodes are used as the extraction unit to design multiple statistical features, and high-order features are designed according to heuristic strategies. This method establishes a small neural network, takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, makes full use of different statistical information and extraction strategies, and adapts to more types of pages. Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.
Bayesian Networks (BNs) have become increasingly popular over the last few decades as a tool for reasoning under uncertainty in fields as diverse as medicine, biology, epidemiology, economics and the social sciences. This is especially true in real-world areas where we seek to answer complex questions based on hypothetical evidence to determine actions for intervention. However, determining the graphical structure of a BN remains a major challenge, especially when modelling a problem under causal assumptions. Solutions to this problem include the automated discovery of BN graphs from data, constructing them based on expert knowledge, or a combination of the two. This paper provides a comprehensive review of combinatoric algorithms proposed for learning BN structure from data, describing 74 algorithms including prototypical, well-established and state-of-the-art approaches. The basic approach of each algorithm is described in consistent terms, and the similarities and differences between them highlighted. Methods of evaluating algorithms and their comparative performance are discussed including the consistency of claims made in the literature. Approaches for dealing with data noise in real-world datasets and incorporating expert knowledge into the learning process are also covered.
Trusted execution environments are quickly rising in popularity as they enable to run workloads in the cloud without having to trust cloud service providers, by offering additional hardware-assisted security guarantees. One key mechanism for server-grade TEEs is main memory encryption, as it not only prevents system-level attackers from reading the TEE's content, but also provides protection against physical, off-chip attackers. The recent Cipherleaks attacks show that the memory encryption system of AMD SEV-SNP and potentially other TEEs are vulnerable to a new kind of attack, dubbed the ciphertext side-channel. The ciphertext side-channel allows to leak secret data from TEE-protected implementations by analyzing ciphertext patterns exhibited due to deterministic memory encryption. It cannot be mitigated by current best practices like data-oblivious constant-time code. As these ciphertext leakages are inherent to deterministic memory encryption, a hardware fix on existing systems is unlikely. Thus, in this paper, we present a software-based, drop-in solution that can harden existing binaries such that they can be safely executed under TEEs vulnerable to ciphertext side-channels. We combine taint tracking with both static and dynamic binary instrumentation to find sensitive memory locations and prevent the leakage by masking secret data before it gets written to memory. This way, although the memory encryption remains deterministic, we destroy any secret-dependent patterns in encrypted memory. We show that our proof-of-concept implementation can protect constant-time EdDSA and ECDSA implementations against ciphertext side-channels.
Metaverse has rekindled human beings' desire to further break space-time barriers by fusing the virtual and real worlds. However, security and privacy threats hinder us from building a utopia. A metaverse embraces various techniques, while at the same time inheriting their pitfalls and thus exposing large attack surfaces. Blockchain, proposed in 2008, was regarded as a key building block of metaverses. it enables transparent and trusted computing environments using tamper-resistant decentralized ledgers. Currently, blockchain supports Decentralized Finance (DeFi) and Non-fungible Tokens (NFT) for metaverses. However, the power of a blockchain has not been sufficiently exploited. In this article, we propose a novel trustless architecture of blockchain-enabled metaverse, aiming to provide efficient resource integration and allocation by consolidating hardware and software components. To realize our design objectives, we provide an On-Demand Trusted Computing Environment (OTCE) technique based on local trust evaluation. Specifically, the architecture adopts a hypergraph to represent a metaverse, in which each hyperedge links a group of users with certain relationship. Then the trust level of each user group can be evaluated based on graph analytics techniques. Based on the trust value, each group can determine its security plan on demand, free from interference by irrelevant nodes. Besides, OTCEs enable large-scale and flexible application environments (sandboxes) while preserving a strong security guarantee.
In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
Advances in artificial intelligence often stem from the development of new environments that abstract real-world situations into a form where research can be done conveniently. This paper contributes such an environment based on ideas inspired by elementary Microeconomics. Agents learn to produce resources in a spatially complex world, trade them with one another, and consume those that they prefer. We show that the emergent production, consumption, and pricing behaviors respond to environmental conditions in the directions predicted by supply and demand shifts in Microeconomics. We also demonstrate settings where the agents' emergent prices for goods vary over space, reflecting the local abundance of goods. After the price disparities emerge, some agents then discover a niche of transporting goods between regions with different prevailing prices -- a profitable strategy because they can buy goods where they are cheap and sell them where they are expensive. Finally, in a series of ablation experiments, we investigate how choices in the environmental rewards, bartering actions, agent architecture, and ability to consume tradable goods can either aid or inhibit the emergence of this economic behavior. This work is part of the environment development branch of a research program that aims to build human-like artificial general intelligence through multi-agent interactions in simulated societies. By exploring which environment features are needed for the basic phenomena of elementary microeconomics to emerge automatically from learning, we arrive at an environment that differs from those studied in prior multi-agent reinforcement learning work along several dimensions. For example, the model incorporates heterogeneous tastes and physical abilities, and agents negotiate with one another as a grounded form of communication.
Graph machine learning has been extensively studied in both academic and industry. However, as the literature on graph learning booms with a vast number of emerging methods and techniques, it becomes increasingly difficult to manually design the optimal machine learning algorithm for different graph-related tasks. To tackle the challenge, automated graph machine learning, which aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design, is gaining an increasing number of attentions from the research community. In this paper, we extensively discuss automated graph machine approaches, covering hyper-parameter optimization (HPO) and neural architecture search (NAS) for graph machine learning. We briefly overview existing libraries designed for either graph machine learning or automated machine learning respectively, and further in depth introduce AutoGL, our dedicated and the world's first open-source library for automated graph machine learning. Last but not least, we share our insights on future research directions for automated graph machine learning. This paper is the first systematic and comprehensive discussion of approaches, libraries as well as directions for automated graph machine learning.
The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be alleviated if we could partially predict a network's trained accuracy from its initial state. In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NAS-Bench-201, NATS-Bench, and Network Design Spaces. Our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search. Code for reproducing our experiments is available at //github.com/BayesWatch/nas-without-training.
Most of the internet today is composed of digital media that includes videos and images. With pixels becoming the currency in which most transactions happen on the internet, it is becoming increasingly important to have a way of browsing through this ocean of information with relative ease. YouTube has 400 hours of video uploaded every minute and many million images are browsed on Instagram, Facebook, etc. Inspired by recent advances in the field of deep learning and success that it has gained on various problems like image captioning and, machine translation , word2vec , skip thoughts, etc, we present DeepSeek a natural language processing based deep learning model that allows users to enter a description of the kind of images that they want to search, and in response the system retrieves all the images that semantically and contextually relate to the query. Two approaches are described in the following sections.