We use decision theory to compare variants of differential privacy from the perspective of prospective study participants. We posit the existence of a preference ordering on the set of potential consequences that study participants can incur, which enables the analysis of individual utility functions. Drawing upon the theory of measurement, we argue that changes in expected utilities should be measured via the classic Euclidean metric. We then consider the question of which privacy guarantees would be more appealing for individuals under different decision settings. Through our analysis, we found that the nature of the potential participant's utility function, along with the specific values of $\epsilon$ and $\delta$, can greatly alter which privacy guarantees are preferable.
This study employs machine learning models to predict the failure of Peer-to-Peer (P2P) lending platforms, specifically in China. By employing the filter method and wrapper method with forward selection and backward elimination, we establish a rigorous and practical procedure that ensures the robustness and importance of variables in predicting platform failures. The research identifies a set of robust variables that consistently appear in the feature subsets across different selection methods and models, suggesting their reliability and relevance in predicting platform failures. The study highlights that reducing the number of variables in the feature subset leads to an increase in the false acceptance rate while the performance metrics remain stable, with an AUC value of approximately 0.96 and an F1 score of around 0.88. The findings of this research provide significant practical implications for regulatory authorities and investors operating in the Chinese P2P lending industry.
An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent words believed to provide limited information about the corpus. Despite popularity of vocabulary pruning, not many guidelines on how to implement it are available in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation. The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent terms removal and various evaluation metrics. The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.
The allure of aesthetic appeal in images captivates our senses, yet the underlying intricacies of aesthetic preferences remain elusive. In this study, we pioneer a novel perspective by utilizing machine learning models that focus on aesthetic attributes known to influence preferences. Through a data mining approach, our models process these attributes as inputs to predict the aesthetic scores of images. Moreover, to delve deeper and obtain interpretable explanations regarding the factors driving aesthetic preferences, we utilize the popular Explainable AI (XAI) technique known as SHapley Additive exPlanations (SHAP). Our methodology involves employing various machine learning models, including Random Forest, XGBoost, Support Vector Regression, and Multilayer Perceptron, to compare their performances in accurately predicting aesthetic scores, and consistently observing results in conjunction with SHAP. We conduct experiments on three image aesthetic benchmarks, providing insights into the roles of attributes and their interactions. Ultimately, our study aims to shed light on the complex nature of aesthetic preferences in images through machine learning and provides a deeper understanding of the attributes that influence aesthetic judgements.
ChatGPT and other Generative Artificial Intelligence (GAI) models tend to inherit and even amplify prevailing societal biases as they are trained on large amounts of existing data. Given the increasing usage of ChatGPT and other GAI by students, faculty members, and staff in higher education institutions (HEIs), there is an urgent need to examine the ethical issues involved such as its potential biases. In this scoping review, we clarify the ways in which biases related to GAI in higher education settings have been discussed in recent academic publications and identify what type of potential biases are commonly reported in this body of literature. We searched for academic articles written in English, Chinese, and Japanese across four main databases concerned with GAI usage in higher education and bias. Our findings show that while there is an awareness of potential biases around large language models (LLMs) and GAI, the majority of articles touch on ``bias'' at a relatively superficial level. Few identify what types of bias may occur under what circumstances. Neither do they discuss the possible implications for the higher education, staff, faculty members, or students. There is a notable lack of empirical work at this point, and we call for higher education researchers and AI experts to conduct more research in this area.
Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
Microservices are increasingly used in modern applications, leading to a growing need for effective service composition solutions. However, we argue that traditional API-centric composition mechanisms (e.g., RPC, REST, and Pub/Sub) hamper the modularity of microservices. These mechanisms introduce rigid code-level coupling, scatter composition logic, and hinder visibility into cross-service data exchanges. Ultimately, these limitations complicate the maintenance and evolution of microservice-based applications. In response, we propose a rethinking of service composition and present Knactor, a new data-centric composition framework to restore the modularity that microservices were intended to offer. Knactor decouples service composition from service development, allowing composition to be implemented as explicit data exchanges among multiple services. Our initial case study suggests that Knactor simplifies service composition and creates new opportunities for optimizations.
In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.
Understanding causality helps to structure interventions to achieve specific goals and enables predictions under interventions. With the growing importance of learning causal relationships, causal discovery tasks have transitioned from using traditional methods to infer potential causal structures from observational data to the field of pattern recognition involved in deep learning. The rapid accumulation of massive data promotes the emergence of causal search methods with brilliant scalability. Existing summaries of causal discovery methods mainly focus on traditional methods based on constraints, scores and FCMs, there is a lack of perfect sorting and elaboration for deep learning-based methods, also lacking some considers and exploration of causal discovery methods from the perspective of variable paradigms. Therefore, we divide the possible causal discovery tasks into three types according to the variable paradigm and give the definitions of the three tasks respectively, define and instantiate the relevant datasets for each task and the final causal model constructed at the same time, then reviews the main existing causal discovery methods for different tasks. Finally, we propose some roadmaps from different perspectives for the current research gaps in the field of causal discovery and point out future research directions.
In contrast to batch learning where all training data is available at once, continual learning represents a family of methods that accumulate knowledge and learn continuously with data available in sequential order. Similar to the human learning process with the ability of learning, fusing, and accumulating new knowledge coming at different time steps, continual learning is considered to have high practical significance. Hence, continual learning has been studied in various artificial intelligence tasks. In this paper, we present a comprehensive review of the recent progress of continual learning in computer vision. In particular, the works are grouped by their representative techniques, including regularization, knowledge distillation, memory, generative replay, parameter isolation, and a combination of the above techniques. For each category of these techniques, both its characteristics and applications in computer vision are presented. At the end of this overview, several subareas, where continuous knowledge accumulation is potentially helpful while continual learning has not been well studied, are discussed.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.