With the violation of the assumption of homoskedasticity, least squares estimators of the variance become inefficient and statistical inference conducted with invalid standard errors leads to misleading rejection rates. Despite a vast cross-sectional literature on the downward bias of robust standard errors, the problem is not extensively covered in the panel data framework. We investigate the consequences of the simultaneous presence of small sample size, heteroskedasticity and data points that exhibit extreme values in the covariates ('good leverage points') on the statistical inference. Focusing on one-way linear panel data models, we examine asymptotic and finite sample properties of a battery of heteroskedasticity-consistent estimators using Monte Carlo simulations. We also propose a hybrid estimator of the variance-covariance matrix. Results show that conventional standard errors are always dominated by more conservative estimators of the variance, especially in small samples. In addition, all types of HC standard errors have excellent performances in terms of size and power tests under homoskedasticity.
The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of Artificial Intelligence (AI) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture LLMs' complex behaviors and potential risks. Our study highlighted the necessity for a paradigm shift in LLM evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of AI systems' integration into society.
Due to various sources of uncertainty, emergent behavior, and ongoing changes, the reliability of many socio-technical systems depends on an iterative and collaborative process in which organizations (1) analyze and learn from system failures, and then (2) co-evolve both the technical and human parts of their systems based on what they learn. Many organizations have defined processes for learning from failure, often involving postmortem analyses conducted after any system failures that are judged to be sufficiently severe. Despite established processes and tool support, our preliminary research, and professional experience, suggest that it is not straightforward to take what was learned from a failure and successfully improve the reliability of the socio-technical system. To better understand this collaborative process and the associated challenges, we are conducting a study of how teams learn from failure. We are gathering incident reports from multiple organizations and conducting interviews with engineers and managers with relevant experience. Our analytic interest is in what is learned by teams as they reflect on failures, the learning processes involved, and how they use what is learned. Our data collection and analysis are not yet complete, but we have so far analyzed 13 incident reports and seven interviews. In this short paper we (1) present our preliminary findings, and (2) outline our broader research plans.
We propose some new results on the comparison of the minimum or maximum order statistic from a random number of non-identical random variables. Under the non-identical set-up, with certain conditions, we prove that random minimum (maximum) of one system dominates the other in hazard rate (reversed hazard rate) order. Further, we prove variation diminishing property (Karlin [8]) for all possible restrictions to derive the new results.
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
Unlike ordinal-utility matching markets, which are well-developed from the viewpoint of both theory and practice, recent insights from a computer science perspective have left cardinal-utility matching markets in a quandary. The celebrated pricing-based mechanism for one-sided cardinal-utility matching markets due to Hylland and Zeckhauser, which had long eluded efficient algorithms, was finally shown to be PPAD-complete. This led us to ask the question: is there an alternative, polynomial time, mechanism for one-sided cardinal-utility matching markets which achieves the desirable properties of HZ, i.e.\ (ex-ante) envy-freeness (EF) and Pareto-optimality (PO)? In this paper we show: 1. The problem of finding an EF+PO lottery in a one-sided cardinal-utility matching market is PPAD-complete. 2. A $(2 + \epsilon)$-approximately envy-free and (exactly) Pareto-optimal lottery can be found in polynomial time using Nash bargaining. We also present several results on two-sided cardinal-utility matching markets, including non-existence of EF+PO lotteries as well as existence of justified-envy-free and weak Pareto-optimal lotteries.
Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative workflows. While optimizing downstream reward models has emerged as a promising alignment strategy, concerns arise regarding the risk of excessive optimization with learned reward models, which potentially compromises ground-truth performance. In this work, we confront the reward overoptimization problem in diffusion model alignment through the lenses of both inductive and primacy biases. We first identify the divergence of current methods from the temporal inductive bias inherent in the multi-step denoising process of diffusion models as a potential source of overoptimization. Then, we surprisingly discover that dormant neurons in our critic model act as a regularization against overoptimization, while active neurons reflect primacy bias in this setting. Motivated by these observations, we propose Temporal Diffusion Policy Optimization with critic active neuron Reset (TDPO-R), a policy gradient algorithm that exploits the temporal inductive bias of intermediate timesteps, along with a novel reset strategy that targets active neurons to counteract the primacy bias. Empirical results demonstrate the superior efficacy of our algorithms in mitigating reward overoptimization.
Fervent calls for more robust governance of the harms associated with artificial intelligence (AI) are leading to the adoption around the world of what regulatory scholars have called a management-based approach to regulation. Recent initiatives in the United States and Europe, as well as the adoption of major self-regulatory standards by the International Organization for Standardization, share in common a core management-based paradigm. These management-based initiatives seek to motivate an increase in human oversight of how AI tools are trained and developed. Refinements and systematization of human-guided training techniques will thus be needed to fit within this emerging era of management-based regulatory paradigm. If taken seriously, human-guided training can alleviate some of the technical and ethical pressures on AI, boosting AI performance with human intuition as well as better addressing the needs for fairness and effective explainability. In this paper, we discuss the connection between the emerging management-based regulatory frameworks governing AI and the need for human oversight during training. We broadly cover some of the technical components involved in human-guided training and then argue that the kinds of high-stakes use cases for AI that appear of most concern to regulators should lean more on human-guided training than on data-only training. We hope to foster a discussion between legal scholars and computer scientists involving how to govern a domain of technology that is vast, heterogenous, and dynamic in its applications and risks.
In this study, the main objective is to develop an algorithm capable of identifying and delineating tumor regions in breast ultrasound (BUS) and mammographic images. The technique employs two advanced deep learning architectures, namely U-Net and pretrained SAM, for tumor segmentation. The U-Net model is specifically designed for medical image segmentation and leverages its deep convolutional neural network framework to extract meaningful features from input images. On the other hand, the pretrained SAM architecture incorporates a mechanism to capture spatial dependencies and generate segmentation results. Evaluation is conducted on a diverse dataset containing annotated tumor regions in BUS and mammographic images, covering both benign and malignant tumors. This dataset enables a comprehensive assessment of the algorithm's performance across different tumor types. Results demonstrate that the U-Net model outperforms the pretrained SAM architecture in accurately identifying and segmenting tumor regions in both BUS and mammographic images. The U-Net exhibits superior performance in challenging cases involving irregular shapes, indistinct boundaries, and high tumor heterogeneity. In contrast, the pretrained SAM architecture exhibits limitations in accurately identifying tumor areas, particularly for malignant tumors and objects with weak boundaries or complex shapes. These findings highlight the importance of selecting appropriate deep learning architectures tailored for medical image segmentation. The U-Net model showcases its potential as a robust and accurate tool for tumor detection, while the pretrained SAM architecture suggests the need for further improvements to enhance segmentation performance.
A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the convergence of causal inference and language processing. Still, research on causality in NLP remains scattered across domains without unified definitions, benchmark datasets and clear articulations of the remaining challenges. In this survey, we consolidate research across academic areas and situate it in the broader NLP landscape. We introduce the statistical challenge of estimating causal effects, encompassing settings where text is used as an outcome, treatment, or as a means to address confounding. In addition, we explore potential uses of causal inference to improve the performance, robustness, fairness, and interpretability of NLP models. We thus provide a unified overview of causal inference for the computational linguistics community.
This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from -- or the same as -- the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.