The adoption of generative Artificial Intelligence (GAI) in organizational settings calls into question workers' roles, and relatedly, the implications for their long-term skill development and domain expertise. In our qualitative study in the software engineering domain, we build on the theoretical lenses of occupational identity and self-determination theory to understand how and why software engineers make sense of GAI for their work. We find that engineers' sense-making is contingent on domain expertise, as juniors and seniors felt their needs for competence, autonomy, and relatedness to be differently impacted by GAI. We shed light on the importance of the individual's role in preserving tacit domain knowledge as engineers engaged in sense-making that protected their occupational identity. We illustrate how organizations play an active role in shaping workers' sense-making process and propose design guidelines on how organizations and system designers can facilitate the impact of technological change on workers' occupational identity.
State-of-the-art Multiple Object Tracking (MOT) approaches have shown remarkable performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear weather scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of trackers against these challenging conditions remains underexplored. To address this gap, we introduce physics-based volumetric fog simulation method for arbitrary MOT datasets, utilizing frame-by-frame monocular depth estimation and a fog formation optical model. We enhance our simulation by rendering both homogeneous and heterogeneous fog and propose to use the dark channel prior method to estimate atmospheric light, showing promising results even in night and indoor scenes. We present the leading benchmark MOTChallenge (third release) augmented with fog (smoke for indoor scenes) of various intensities and conduct a comprehensive evaluation of MOT methods, revealing their limitations under fog and fog-like challenges.
We explore some of the existing research in HCI around technology for older adults and examine the role of LLMs in enhancing it. We also discuss the digital divide and emphasize the need for inclusive technology design. At the same time, we also surface concerns regarding privacy, security, and the accuracy of information provided by LLMs, alongside the importance of user-centered design to make technology accessible and effective for the elderly. We show the transformative possibilities of LLM-supported interactions at the intersection of aging, technology, and human-computer interaction, advocating for further research and development in this area.
In recent years, more and more researchers have reflected on the undervaluation of emotion in data visualization and highlighted the importance of considering human emotion in visualization design. Meanwhile, an increasing number of studies have been conducted to explore emotion-related factors. However, so far, this research area is still in its early stages and faces a set of challenges, such as the unclear definition of key concepts, the insufficient justification of why emotion is important in visualization design, and the lack of characterization of the design space of affective visualization design. To address these challenges, first, we conducted a literature review and identified three research lines that examined both emotion and data visualization. We clarified the differences between these research lines and kept 109 papers that studied or discussed how data visualization communicates and influences emotion. Then, we coded the 109 papers in terms of how they justified the legitimacy of considering emotion in visualization design (i.e., why emotion is important) and identified five argumentative perspectives. Based on these papers, we also identified 61 projects that practiced affective visualization design. We coded these design projects in three dimensions, including design fields (where), design tasks (what), and design methods (how), to explore the design space of affective visualization design.
The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.
Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.
As Large Language Models (LLMs) continue to advance in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become crucial for real-world applications. While existing benchmarks assess various LLM capabilities, they rarely focus on instruction-following in long-context scenarios or stability on different inputs. In response, we introduce the Long-context Instruction-Following Benchmark (LIFBench), a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, supported by 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment framework that provides precise, automated scoring of complex LLM responses without relying on LLM-assisted evaluations or human judgments. This approach facilitates a comprehensive analysis of model performance and stability across various perspectives. We conduct extensive experiments on 20 notable LLMs across six length intervals, analyzing their instruction-following capabilities and stability. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex, long-context settings, providing insights that can inform future LLM development.
This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic problems.Testing GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, "solid masking," where semantic clues are entirely absent, leads to a significant performance drop compared to "partial lifting," where some semantic information is retained, indicating LLMs' reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.
This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from -- or the same as -- the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.