Many machine learning problems require performing dataset valuation, i.e. to quantify the incremental gain, to some relevant pre-defined utility, of aggregating an individual dataset to others. As seminal examples, dataset valuation has been leveraged in collaborative and federated learning to create incentives for data sharing across several data owners. The Shapley value has recently been proposed as a principled tool to achieve this goal due to formal axiomatic justification. Since its computation often requires exponential time, standard approximation strategies based on Monte Carlo integration have been considered. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation of the Shapley value, referred to as discrete uniform Shapley (DU-Shapley) which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and show that DU-Shapley tends towards the Shapley value when the number of data owners is large. The benefits of the proposed framework are finally illustrated on several dataset valuation benchmarks. DU-Shapley outperforms other Shapley value approximations, even when the number of data owners is small.
The development of Machine Learning is experiencing growing interest from the general public, and in recent years there have been numerous press articles questioning its objectivity: racism, sexism, \dots Driven by the growing attention of regulators on the ethical use of data in insurance, the actuarial community must rethink pricing and risk selection practices for fairer insurance. Equity is a philosophy concept that has many different definitions in every jurisdiction that influence each other without currently reaching consensus. In Europe, the Charter of Fundamental Rights defines guidelines on discrimination, and the use of sensitive personal data in algorithms is regulated. If the simple removal of the protected variables prevents any so-called `direct' discrimination, models are still able to `indirectly' discriminate between individuals thanks to latent interactions between variables, which bring better performance (and therefore a better quantification of risk, segmentation of prices, and so on). After introducing the key concepts related to discrimination, we illustrate the complexity of quantifying them. We then propose an innovative method, not yet met in the literature, to reduce the risks of indirect discrimination thanks to mathematical concepts of linear algebra. This technique is illustrated in a concrete case of risk selection in life insurance, demonstrating its simplicity of use and its promising performance.
Randomized controlled trials (RCTs) are a cornerstone of comparative effectiveness because they remove the confounding bias present in observational studies. However, RCTs are typically much smaller than observational studies because of financial and ethical considerations. Therefore it is of great interest to be able to incorporate plentiful observational data into the analysis of smaller RCTs. Previous estimators developed for this purpose rely on unrealistic additional assumptions without which the added data can bias the effect estimate. Recent work proposed an alternative method (prognostic adjustment) that imposes no additional assumption and increases efficiency in the analysis of RCTs. The idea is to use the observational data to learn a prognostic model: a regression of the outcome onto the covariates. The predictions from this model, generated from the RCT subjects' baseline variables, are used as a covariate in a linear model. In this work, we extend this framework to work when conducting inference with nonparametric efficient estimators in trial analysis. Using simulations, we find that this approach provides greater power (i.e., smaller standard errors) than without prognostic adjustment, especially when the trial is small. We also find that the method is robust to observed or unobserved shifts between the observational and trial populations and does not introduce bias. Lastly, we showcase this estimator leveraging real-world historical data on a randomized blood transfusion study of trauma patients.
Conditional Monte Carlo or pre-integration is a powerful tool for reducing variance and improving the regularity of integrands when using Monte Carlo and quasi-Monte Carlo (QMC) methods. To select the variable to pre-integrate, one must consider both the variable's importance and the tractability of the conditional expectation. For integrals over a Gaussian distribution, any linear combination of variables can potentially be pre-integrated. Liu and Owen (2022) propose to select the linear combination based on an active subspace decomposition of the integrand. However, pre-integrating the selected direction might be intractable. In this work, we address this issue by finding the active subspace subject to constraints such that pre-integration can be easily carried out. The proposed algorithm also provides a computationally-efficient alternative to dimension reduction for pre-integrated functions. The method is applied to examples from computational finance, density estimation, and computational chemistry, and is shown to achieve smaller errors than previous methods.
In Federated Learning (FL), a number of clients or devices collaborate to train a model without sharing their data. Models are optimized locally at each client and further communicated to a central hub for aggregation. While FL is an appealing decentralized training paradigm, heterogeneity among data from different clients can cause the local optimization to drift away from the global objective. In order to estimate and therefore remove this drift, variance reduction techniques have been incorporated into FL optimization recently. However, these approaches inaccurately estimate the clients' drift and ultimately fail to remove it properly. In this work, we propose an adaptive algorithm that accurately estimates drift across clients. In comparison to previous works, our approach necessitates less storage and communication bandwidth, as well as lower compute costs. Additionally, our proposed methodology induces stability by constraining the norm of estimates for client drift, making it more practical for large scale FL. Experimental findings demonstrate that the proposed algorithm converges significantly faster and achieves higher accuracy than the baselines across various FL benchmarks.
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning Dataset(COCO-MMRD), a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning.
The advent of Web 3.0, underpinned by blockchain technologies, promises to transform the internet's landscape by empowering individuals with decentralized control over their data. However, this evolution brings unique security challenges that need to be addressed. This paper explores these complexities, focusing on enhancing privacy and anonymous auditing within blockchain structures. We present the architecture of Web 3.0 based on the blockchain, providing a clear perspective on its workflow and security mechanisms. A security protocol for Web 3.0 systems, employing privacy-preserving techniques and anonymous auditing during runtime, is proposed. Key components of our solution include the integration of privacy-enhancing techniques and the utilization of Tor for anonymous auditing. We discuss related work and propose a framework that meets these new security requirements. Lastly, we offer an evaluation and comparison of our model to existing methods. This research contributes towards the foundational understanding of Web 3.0's secure structure and offers a pathway towards secure and privacy-preserving digital interactions in this novel internet landscape.
Knowledge graphs represent facts about real-world entities. Most of these facts are defined as positive statements. The negative statements are scarce but highly relevant under the open-world assumption. Furthermore, they have been demonstrated to improve the performance of several applications, namely in the biomedical domain. However, no benchmark dataset supports the evaluation of the methods that consider these negative statements. We present a collection of datasets for three relation prediction tasks - protein-protein interaction prediction, gene-disease association prediction and disease prediction - that aim at circumventing the difficulties in building benchmarks for knowledge graphs with negative statements. These datasets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, enriched with negative statements. We also generate knowledge graph embeddings for each dataset with two popular path-based methods and evaluate the performance in each task. The results show that the negative statements can improve the performance of knowledge graph embeddings.
Federated learning (FL) has been developed as a promising framework to leverage the resources of edge devices, enhance customers' privacy, comply with regulations, and reduce development costs. Although many methods and applications have been developed for FL, several critical challenges for practical FL systems remain unaddressed. This paper provides an outlook on FL development, categorized into five emerging directions of FL, namely algorithm foundation, personalization, hardware and security constraints, lifelong learning, and nonstandard data. Our unique perspectives are backed by practical observations from large-scale federated systems for edge devices.
Reasoning with knowledge expressed in natural language and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. General neural architectures that jointly learn representations and transformations of text are very data-inefficient, and it is hard to analyse their reasoning process. These issues are addressed by end-to-end differentiable reasoning systems such as Neural Theorem Provers (NTPs), although they can only be used with small-scale symbolic KBs. In this paper we first propose Greedy NTPs (GNTPs), an extension to NTPs addressing their complexity and scalability limitations, thus making them applicable to real-world datasets. This result is achieved by dynamically constructing the computation graph of NTPs and including only the most promising proof paths during inference, thus obtaining orders of magnitude more efficient models. Then, we propose a novel approach for jointly reasoning over KBs and textual mentions, by embedding logic facts and natural language sentences in a shared embedding space. We show that GNTPs perform on par with NTPs at a fraction of their cost while achieving competitive link prediction results on large datasets, providing explanations for predictions, and inducing interpretable models. Source code, datasets, and supplementary material are available online at //github.com/uclnlp/gntp.
Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $94k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.