Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.
With the violation of the assumption of homoskedasticity, least squares estimators of the variance become inefficient and statistical inference conducted with invalid standard errors leads to misleading rejection rates. Despite a vast cross-sectional literature on the downward bias of robust standard errors, the problem is not extensively covered in the panel data framework. We investigate the consequences of the simultaneous presence of small sample size, heteroskedasticity and data points that exhibit extreme values in the covariates ('good leverage points') on the statistical inference. Focusing on one-way linear panel data models, we examine asymptotic and finite sample properties of a battery of heteroskedasticity-consistent estimators using Monte Carlo simulations. We also propose a hybrid estimator of the variance-covariance matrix. Results show that conventional standard errors are always dominated by more conservative estimators of the variance, especially in small samples. In addition, all types of HC standard errors have excellent performances in terms of size and power tests under homoskedasticity.
Constructing a universal moral code for artificial intelligence (AI) is difficult or even impossible, given that different human cultures have different definitions of morality and different societal norms. We therefore argue that the value system of an AI should be culturally attuned: just as a child raised in a particular culture learns the specific values and norms of that culture, we propose that an AI agent operating in a particular human community should acquire that community's moral, ethical, and cultural codes. How AI systems might acquire such codes from human observation and interaction has remained an open question. Here, we propose using inverse reinforcement learning (IRL) as a method for AI agents to acquire a culturally-attuned value system implicitly. We test our approach using an experimental paradigm in which AI agents use IRL to learn different reward functions, which govern the agents' moral values, by observing the behavior of different cultural groups in an online virtual world requiring real-time decision making. We show that an AI agent learning from the average behavior of a particular cultural group can acquire altruistic characteristics reflective of that group's behavior, and this learned value system can generalize to new scenarios requiring altruistic judgments. Our results provide, to our knowledge, the first demonstration that AI agents could potentially be endowed with the ability to continually learn their values and norms from observing and interacting with humans, thereby becoming attuned to the culture they are operating in.
We study the problem of screening in decision-making processes under uncertainty, focusing on the impact of adding an additional screening stage, commonly known as a 'gatekeeper.' While our primary analysis is rooted in the context of job market hiring, the principles and findings are broadly applicable to areas such as educational admissions, healthcare patient selection, and financial loan approvals. The gatekeeper's role is to assess applicants' suitability before significant investments are made. Our study reveals that while gatekeepers are designed to streamline the selection process by filtering out less likely candidates, they can sometimes inadvertently affect the candidates' own decision-making process. We explore the conditions under which the introduction of a gatekeeper can enhance or impede the efficiency of these processes. Additionally, we consider how adjusting gatekeeping strategies might impact the accuracy of selection decisions. Our research also extends to scenarios where gatekeeping is influenced by historical biases, particularly in competitive settings like hiring. We discover that candidates confronted with a statistically biased gatekeeping process are more likely to withdraw from applying, thereby perpetuating the previously mentioned historical biases. The study suggests that measures such as affirmative action can be effective in addressing these biases. While centered on hiring, the insights and methodologies from our study have significant implications for a wide range of fields where screening and gatekeeping are integral.
Reconfigurable intelligent surfaces (RISs) are rapidly gaining prominence in the realm of fifth generation (5G)-Advanced, and predominantly, sixth generation (6G) mobile networks, offering a revolutionary approach to optimizing wireless communications. This article delves into the intricate world of the RIS technology, exploring its diverse hardware architectures and the resulting versatile operating modes. These include RISs with signal reception and processing units, sensors, amplification units, transmissive capability, multiple stacked components, and dynamic metasurface antennas. Furthermore, we shed light on emerging RIS applications, such as index and reflection modulation, non-coherent modulation, next generation multiple access, integrated sensing and communications (ISAC), energy harvesting, as well as aerial and vehicular networks. These exciting applications are set to transform the way we will wirelessly connect in the upcoming era of 6G. Finally, we review recent experimental RIS setups and present various open problems of the overviewed RIS hardware architectures and their applications. From enhancing network coverage to enabling new communication paradigms, RIS-empowered connectivity is poised to play a pivotal role in shaping the future of wireless networking. This article unveils the underlying principles and potential impacts of RISs, focusing on cutting-edge developments of this physical-layer smart connectivity technology.
Mixture of experts (MoE) is a popular technique to improve capacity of large models with conditionally-activated parallel neural network modules (experts). Due to its remarkable scaling performance with sparse computation, it is widely used in modern Large Language Models (LLMs) and Large Vision Models (LVMs). However, serving such large models on edge devices is challenging due to memory constraints. Typical solutions like memory swapping or weight pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient continuous MoE-based large models serving with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. We use a profiling-guided planner to allocate the resources for SwapMoE that can fully utilize the memory budgets and bandwidth, and an importance-aware scheduler to efficiently identify, update, and use the Virtual Experts for accurate inference. To evaluate SwapMoE, we conduct experiments on multiple edge devices with state-of-the-art MoE-based Large Language Models and Large Vision Models. The results demonstrate remarkable performance of SwapMoE under various memory constraints. Specifically, SwapMoE can enable running large MoE models under tight memory budgets with similar latency to pruned compact models, while with significantly higher accuracy.
Modern neural recording techniques allow neuroscientists to obtain spiking activity of multiple neurons from different brain regions over long time periods, which requires new statistical methods to be developed for understanding structure of the large-scale data. In this paper, we develop a bi-clustering method to cluster the neural spiking activity spatially and temporally, according to their low-dimensional latent structures. The spatial (neuron) clusters are defined by the latent trajectories within each neural population, while the temporal (state) clusters are defined by (populationally) synchronous local linear dynamics shared with different periods. To flexibly extract the bi-clustering structure, we build the model non-parametrically, and develop an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distributions of model parameters. Validating our proposed MCMC algorithm through simulations, we find the method can recover unknown parameters and true bi-clustering structures successfully. We then apply the proposed bi-clustering method to multi-regional neural recordings under different experiment settings, where we find that simultaneously considering latent trajectories and spatial-temporal clustering structures can provide us with a more accurate and interpretable result. Overall, the proposed method provides scientific insights for large-scale (counting) time series with elongated recording periods, and it can potentially have application beyond neuroscience.
Temporal characteristics are prominently evident in a substantial volume of knowledge, which underscores the pivotal role of Temporal Knowledge Graphs (TKGs) in both academia and industry. However, TKGs often suffer from incompleteness for three main reasons: the continuous emergence of new knowledge, the weakness of the algorithm for extracting structured information from unstructured data, and the lack of information in the source dataset. Thus, the task of Temporal Knowledge Graph Completion (TKGC) has attracted increasing attention, aiming to predict missing items based on the available information. In this paper, we provide a comprehensive review of TKGC methods and their details. Specifically, this paper mainly consists of three components, namely, 1)Background, which covers the preliminaries of TKGC methods, loss functions required for training, as well as the dataset and evaluation protocol; 2)Interpolation, that estimates and predicts the missing elements or set of elements through the relevant available information. It further categorizes related TKGC methods based on how to process temporal information; 3)Extrapolation, which typically focuses on continuous TKGs and predicts future events, and then classifies all extrapolation methods based on the algorithms they utilize. We further pinpoint the challenges and discuss future research directions of TKGC.
Conventional entity typing approaches are based on independent classification paradigms, which make them difficult to recognize inter-dependent, long-tailed and fine-grained entity types. In this paper, we argue that the implicitly entailed extrinsic and intrinsic dependencies between labels can provide critical knowledge to tackle the above challenges. To this end, we propose \emph{Label Reasoning Network(LRN)}, which sequentially reasons fine-grained entity labels by discovering and exploiting label dependencies knowledge entailed in the data. Specifically, LRN utilizes an auto-regressive network to conduct deductive reasoning and a bipartite attribute graph to conduct inductive reasoning between labels, which can effectively model, learn and reason complex label dependencies in a sequence-to-set, end-to-end manner. Experiments show that LRN achieves the state-of-the-art performance on standard ultra fine-grained entity typing benchmarks, and can also resolve the long tail label problem effectively.
Deep generative modelling is a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of which making trade-offs including run-time, diversity, and architectural restrictions. In particular, this compendium covers energy-based models, variational autoencoders, generative adversarial networks, autoregressive models, normalizing flows, in addition to numerous hybrid approaches. These techniques are drawn under a single cohesive framework, comparing and contrasting to explain the premises behind each, while reviewing current state-of-the-art advances and implementations.
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.