Human digital twin (HDT) is an emerging paradigm that bridges physical twins (PTs) with powerful virtual twins (VTs) for assisting complex task executions in human-centric services. In this paper, we study a two-timescale online optimization for building HDT under an end-edge-cloud collaborative framework. As a unique feature of HDT, we consider that PTs' corresponding VTs are deployed on edge servers, consisting of not only generic models placed by downloading experiential knowledge from the cloud but also customized models updated by collecting personalized data from end devices. To maximize task execution accuracy with stringent energy and delay constraints, and by taking into account HDT's inherent mobility and status variation uncertainties, we jointly and dynamically optimize VTs' construction and PTs' task offloading, along with communication and computation resource allocations. Observing that decision variables are asynchronous with different triggers, we propose a novel two-timescale accuracy-aware online optimization approach (TACO). Specifically, TACO utilizes an improved Lyapunov method to decompose the problem into multiple instant ones, and then leverages piecewise McCormick envelopes and block coordinate descent based algorithms, addressing two timescales alternately. Theoretical analyses and simulations show that the proposed approach can reach asymptotic optimum within a polynomial-time complexity, and demonstrate its superiority over counterparts.
Owning to the reflection gain and double path loss featured by intelligent reflecting surface (IRS) channels, handover (HO) locations become irregular and the signal strength fluctuates sharply with variations in IRS connections during HO, the risk of HO failures (HOFs) is exacerbated and thus HO parameters require reconfiguration. However, existing HO models only assume monotonic negative exponential path loss and cannot obtain sound HO parameters. This paper proposes a discrete-time model to explicitly track the HO process with variations in IRS connections, where IRS connections and HO process are discretized as finite states by measurement intervals, and transitions between states are modeled as stochastic processes. Specifically, to capture signal fluctuations during HO, IRS connection state-dependent distributions of the user-IRS distance are modified by the correlation between measurement intervals. In addition, states of the HO process are formed with Time-to-Trigger and HO margin whose transition probabilities are integrated concerning all IRS connection states. Trigger location distributions and probabilities of HO, HOF, and ping-pong (PP) are obtained by tracing user HO states. Results show IRSs mitigate PPs by 48% but exacerbate HOFs by 90% under regular parameters. Optimal parameters are mined ensuring probabilities of HOF and PP are both less than 0.1%.
Reconfigurable intelligent surface (RIS) is a novel meta-material which can form a smart radio environment by dynamically altering reflection directions of the impinging electromagnetic waves. In the prior literature, the inter-RIS links which also contribute to the performance of the whole system are usually neglected when multiple RISs are deployed. In this paper we investigate a general double-RIS assisted multiple-input multiple-output (MIMO) wireless communication system under spatially correlated non line-of-sight propagation channels, where the cooperation of the double RISs is also considered. The design objective is to maximize the achievable ergodic rate based on full statistical channel state information (CSI). Specifically, we firstly present a closed-form asymptotic expression for the achievable ergodic rate by utilizing replica method from statistical physics. Then a full statistical CSI-enabled optimal design is proposed which avoids high pilot training overhead compared to instantaneous CSI-enabled design. To further reduce the signal processing overhead and lower the complexity for practical realization, a common-phase scheme is proposed to design the double RISs. Simulation results show that the derived asymptotic ergodic rate is quite accurate even for small-sized antenna arrays. And the proposed optimization algorithm can achieve substantial gain at the expense of a low overhead and complexity. Furthermore, the cooperative double-RIS assisted MIMO framework is proven to achieve superior ergodic rate performance and high communication reliability under harsh propagation environment.
We present an extensive, in-depth analysis of the eye tracking capabilities of the Meta Quest Pro virtual reality headset using a dataset of eye movement recordings collected from 78 participants. In addition to presenting classical signal quality metrics--spatial accuracy, spatial precision and linearity--in ideal settings, we also study the impact of background luminance and headset slippage on device performance. We additionally present a user-centered analysis of eye tracking signal quality, where we highlight the potential differences in user experience as a function of device performance. This work contributes to a growing understanding of eye tracking signal quality in virtual reality headsets, where the performance of applications such as gaze-based interaction, foveated rendering, and social gaze are directly dependent on the quality of eye tracking signal.
We present a framework for learning visually-guided quadruped locomotion by integrating exteroceptive sensing and central pattern generators (CPGs), i.e. systems of coupled oscillators, into the deep reinforcement learning (DRL) framework. Through both exteroceptive and proprioceptive sensing, the agent learns to coordinate rhythmic behavior among different oscillators to track velocity commands, while at the same time override these commands to avoid collisions with the environment. We investigate several open robotics and neuroscience questions: 1) What is the role of explicit interoscillator couplings between oscillators, and can such coupling improve sim-to-real transfer for navigation robustness? 2) What are the effects of using a memory-enabled vs. a memory-free policy network with respect to robustness, energy-efficiency, and tracking performance in sim-to-real navigation tasks? 3) How do animals manage to tolerate high sensorimotor delays, yet still produce smooth and robust gaits? To answer these questions, we train our perceptive locomotion policies in simulation and perform sim-to-real transfers to the Unitree Go1 quadruped, where we observe robust navigation in a variety of scenarios. Our results show that the CPG, explicit interoscillator couplings, and memory-enabled policy representations are all beneficial for energy efficiency, robustness to noise and sensory delays of 90 ms, and tracking performance for successful sim-to-real transfer for navigation tasks. Video results can be found at //youtu.be/wpsbSMzIwgM.
Tangible interfaces in mixed reality (MR) environments allow for intuitive data interactions. Tangible cubes, with their rich interaction affordances, high maneuverability, and stable structure, are particularly well-suited for exploring multi-dimensional data types. However, the design potential of these cubes is underexplored. This study introduces a design space for tangible cubes in MR, focusing on interaction space, visualization space, sizes, and multiplicity. Using spatio-temporal data, we explored the interaction affordances of these cubes in a workshop (N=24). We identified unique interactions like rotating, tapping, and stacking, which are linked to augmented reality (AR) visualization commands. Integrating user-identified interactions, we created a design space for tangible-cube interactions and visualization. A prototype visualizing global health spending with small cubes was developed and evaluated, supporting both individual and combined cube manipulation. This research enhances our grasp of tangible interaction in MR, offering insights for future design and application in diverse data contexts.
This paper studies a multiplayer reach-avoid differential game in the presence of general polygonal obstacles that block the players' motions. The pursuers cooperate to protect a convex region from the evaders who try to reach the region. We propose a multiplayer onsite and close-to-goal (MOCG) pursuit strategy that can tell and achieve an increasing lower bound on the number of guaranteed defeated evaders. This pursuit strategy fuses the subgame outcomes for multiple pursuers against one evader with hierarchical optimal task allocation in the receding-horizon manner. To determine the qualitative subgame outcomes that who is the game winner, we construct three pursuit winning regions and strategies under which the pursuers guarantee to win against the evader, regardless of the unknown evader strategy. First, we utilize the expanded Apollonius circles and propose the onsite pursuit winning that achieves the capture in finite time. Second, we introduce convex goal-covering polygons (GCPs) and propose the close-to-goal pursuit winning for the pursuers whose visibility region contains the whole protected region, and the goal-visible property will be preserved afterwards. Third, we employ Euclidean shortest paths (ESPs) and construct a pursuit winning region and strategy for the non-goal-visible pursuers, where the pursuers are firstly steered to positions with goal visibility along ESPs. In each horizon, the hierarchical optimal task allocation maximizes the number of defeated evaders and consists of four sequential matchings: capture, enhanced, non-dominated and closest matchings. Numerical examples are presented to illustrate the results.
Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequence modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs (Convolutional Neural Networks) for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has led to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we analyze pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat the state of the art {\color{\colorrevtwo}on three surgical workflow benchmarks} by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code is available at: \url{//gitlab.com/nct_tso_public/pitfalls_bn}
While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
With the extremely rapid advances in remote sensing (RS) technology, a great quantity of Earth observation (EO) data featuring considerable and complicated heterogeneity is readily available nowadays, which renders researchers an opportunity to tackle current geoscience applications in a fresh way. With the joint utilization of EO data, much research on multimodal RS data fusion has made tremendous progress in recent years, yet these developed traditional algorithms inevitably meet the performance bottleneck due to the lack of the ability to comprehensively analyse and interpret these strongly heterogeneous data. Hence, this non-negligible limitation further arouses an intense demand for an alternative tool with powerful processing competence. Deep learning (DL), as a cutting-edge technology, has witnessed remarkable breakthroughs in numerous computer vision tasks owing to its impressive ability in data representation and reconstruction. Naturally, it has been successfully applied to the field of multimodal RS data fusion, yielding great improvement compared with traditional methods. This survey aims to present a systematic overview in DL-based multimodal RS data fusion. More specifically, some essential knowledge about this topic is first given. Subsequently, a literature survey is conducted to analyse the trends of this field. Some prevalent sub-fields in the multimodal RS data fusion are then reviewed in terms of the to-be-fused data modalities, i.e., spatiospectral, spatiotemporal, light detection and ranging-optical, synthetic aperture radar-optical, and RS-Geospatial Big Data fusion. Furthermore, We collect and summarize some valuable resources for the sake of the development in multimodal RS data fusion. Finally, the remaining challenges and potential future directions are highlighted.