Conversational recommender systems (CRS) involve both recommendation and dialogue tasks, which makes their evaluation a unique challenge. Although past research has analyzed various factors that may affect user satisfaction with CRS interactions from the perspective of user studies, few evaluation metrics for CRS have been proposed. Recent studies have shown that LLMs can align with human preferences, and several LLM-based text quality evaluation measures have been introduced. However, the application of LLMs in CRS evaluation remains relatively limited. To address this research gap and advance the development of user-centric conversational recommender systems, this study proposes an automated LLM-based CRS evaluation framework, building upon existing research in human-computer interaction and psychology. The framework evaluates CRS from four dimensions: dialogue behavior, language expression, recommendation items, and response content. We use this framework to evaluate four different conversational recommender systems.
Hybrid storage systems (HSS) combine multiple storage devices with diverse characteristics to achieve high performance and capacity at low cost. The performance of an HSS highly depends on the effectiveness of two key policies: (1) the data-placement policy, which determines the best-fit storage device for incoming data, and (2) the data-migration policy, which rearranges stored data across the devices to sustain high HSS performance. Prior works focus on improving only data placement or only data migration in HSS, which leads to sub-optimal HSS performance. Unfortunately, no prior work tries to optimize both policies together. Our goal is to design a holistic data-management technique for HSS that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. We propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique that employs two light-weight autonomous RL agents, a data-placement agent and a data-migration agent, which adapt their policies for the current workload and HSS configuration, and coordinate with each other to improve overall HSS performance. We evaluate Harmonia on a real HSS with up to four heterogeneous storage devices with diverse characteristics. Our evaluation using 17 data-intensive workloads on performance-optimized (cost-optimized) HSS with two storage devices shows that, on average, Harmonia (1) outperforms the best-performing prior approach by 49.5% (31.7%), (2) bridges the performance gap between the best-performing prior work and Oracle by 64.2% (64.3%). On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 37.0% (42.0%). Harmonia's performance benefits come with low latency (240ns for inference) and storage overheads (206 KiB for both RL agents together). We plan to open-source Harmonia's implementation to aid future research on HSS.
This work introduces a novel decentralized framework to interpret federated learning (FL) and, consequently, correct the biases introduced by arbitrary client participation and data heterogeneity, which are two typical traits in practical FL. Specifically, we first reformulate the core processes of FedAvg - client participation, local updating, and model aggregation - as stochastic matrix multiplications. This reformulation allows us to interpret FedAvg as a decentralized algorithm. Leveraging the decentralized optimization framework, we are able to provide a concise analysis to quantify the impact of arbitrary client participation and data heterogeneity on FedAvg's convergence point. This insight motivates the development of Federated Optimization with Exact Convergence via Push-pull Strategy (FOCUS), a novel algorithm inspired by the decentralized algorithm that eliminates these biases and achieves exact convergence without requiring the bounded heterogeneity assumption. Furthermore, we theoretically prove that FOCUS exhibits linear convergence (exponential decay) for both strongly convex and non-convex functions satisfying the Polyak-Lojasiewicz condition, regardless of the arbitrary nature of client participation.
Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.
Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
Deep models in industrial applications rely on thousands of features for accurate predictions, such as deep recommendation systems. While new features are introduced to capture evolving user behavior, outdated or redundant features often remain, significantly increasing storage and computational costs. To address this issue, feature selection methods are widely adopted to identify and remove less important features. However, existing approaches face two major challenges: (1) they often require complex Hyperparameter (Hp) tuning, making them difficult to employ in practice, and (2) they fail to produce well-separated feature importance scores, which complicates straightforward feature removal. Moreover, the impact of removing unimportant features can only be evaluated through retraining the model, a time-consuming and resource-intensive process that severely hinders efficient feature selection. To solve these challenges, we propose a novel feature selection approach, Shuffle-Gate. In particular, it shuffles all feature values across instances simultaneously and uses a gating mechanism that allows the model to dynamically learn the weights for combining the original and shuffled inputs. Notably, it can generate well-separated feature importance scores and estimate the performance without retraining the model, while introducing only a single Hp. Experiments on four public datasets show that our approach outperforms state-of-the-art methods in selecting the top half of the feature set for model retraining. Moreover, it has been successfully integrated into the daily iteration of Bilibili's search models across various scenarios, where it significantly reduces feature set size and computational resource usage, while maintaining comparable performance.
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
O-RAN has brought in deployment flexibility and intelligent RAN control for mobile operators through its disaggregated and modular architecture using open interfaces. However, this disaggregation introduces complexities in system integration and network management, as components are often sourced from different vendors. In addition, the operators who are relying on open source and virtualized components -- which are deployed on commodity hardware -- require additional resilient solutions as O-RAN deployments suffer from the risk of failures at multiple levels including infrastructure, platform, and RAN levels. To address these challenges, this paper proposes FALCON, a fault prediction framework for O-RAN, which leverages infrastructure-, platform-, and RAN-level telemetry to predict faults in virtualized O-RAN deployments. By aggregating and analyzing metrics from various components at different levels using AI/ML models, the FALCON framework enables proactive fault management, providing operators with actionable insights to implement timely preventive measures. The FALCON framework, using a Random Forest classifier, outperforms two other classifiers on the predicted telemetry, achieving an average accuracy and F1-score of more than 98%.
Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose flexible projection (FP)-type receivers that unify the projection-type receivers and the successive interference cancellation (SIC)-type receivers by using a flexible tradeoff factor to adapt to dynamically changing uplink ISAC scenarios. The FP-type receivers address the joint signal detection and target response estimation problem through two coordinated phases: 1) Communication signal detection using a reconstructed signal whose composition is controlled by the tradeoff factor, followed by 2) Target response estimation performed through subtraction of the detected communication signal from the received signal. With adjustable tradeoff factors, the FP-type receivers can balance the enhancement of the signal-to-interference-plus-noise ratio (SINR) with the reduction of correlation in the reconstructed signal for communication signal detection. The pairwise error probabilities (PEPs) are analyzed for both maximum likelihood (ML) and zero-forcing (ZF) detectors, revealing that the optimal tradeoff factor should be determined based on the adopted detection algorithm and the relative power of the sensing and communication (S&C) signal. A homotopy optimization framework is first applied for the FP-type receivers with a fixed trade-off factor. This framework is then extended to develop dynamic FP (DFP)-type receivers, which iteratively adjust the trade-off factor for improved algorithm performance and environmental adaptability. Subsequently, two extensions are explored to further enhance the receivers' performance: parallel DFP (PDFP)-type receivers and a block-structured receiver design. Finally, the effectiveness of the proposed receiver designs is verified via simulations.
Fully homomorphic encryption (FHE) is a technique that enables statistical processing and machine learning while protecting data including sensitive information collected by such single board computers (SBCs) on a cloud server. Among FHE schemes, the TFHE scheme is capable of homomorphic NAND operation, and unlike other FHE schemes, it can perform any operation, such as minimum, maximum, and comparison operations. However, TFHE requires Torus Learning With Error (TLWE) encryption, which encrypts one bit at a time, resulting in less efficient encryption and larger ciphertext size than the other schemes. In addition, SBCs have a limited number of hardware accelerators compared to servers, making it difficult to perform the same optimization as servers. In this study, we propose a novel SBC-specific design TFHE-SBC to accelerate the client-side TFHE operations and achieve communication and energy efficiency. Experimental results show that the TFHE-SBC encryption is up to 2486 times faster, communication efficiency improves 512 times higher, and 12 to 2004 times more energy efficiency than the state-of-the-art.
This work revisits optimal response-adaptive designs from a type-I error rate perspective, highlighting when and how much these allocations exacerbate type-I error rate inflation - an issue previously undocumented. We explore a range of approaches from the literature that can be applied to reduce type-I error rate inflation. However, we found that all of these approaches fail to give a robust solution to the problem. To address this, we derive two optimal proportions, incorporating the more robust score test (instead of the Wald test) with finite sample estimators (instead of the unknown true values) in the formulation of the optimization problem. One proportion optimizes statistical power and the other minimizes the total number failures in a trial while maintaining a predefined power level. Through simulations based on an early-phase and a confirmatory trial we provide crucial practical insight into how these new optimal proportion designs can offer substantial patient outcomes advantages while controlling type-I error rate. While we focused on binary outcomes, the framework offers valuable insights that naturally extend to other outcome types, multi-armed trials and alternative measures of interest.