In this work, we present a new dataset and a computational strategy for a digital coach that aims to guide users in practicing the protocols of self-attachment therapy. Our framework augments a rule-based conversational agent with a deep-learning classifier for identifying the underlying emotion in a user's text response, as well as a deep-learning assisted retrieval method for producing novel, fluent and empathetic utterances. We also craft a set of human-like personas that users can choose to interact with. Our goal is to achieve a high level of engagement during virtual therapy sessions. We evaluate the effectiveness of our framework in a non-clinical trial with N=16 participants, all of whom have had at least four interactions with the agent over the course of five days. We find that our platform is consistently rated higher for empathy, user engagement and usefulness than the simple rule-based framework. Finally, we provide guidelines to further improve the design and performance of the application, in accordance with the feedback received.
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at //github.com/zh460045050/V2L-Tokenizer.
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method will be open-sourced in //github.com/Audio-WestlakeU/NBSS.
In this study, we address the challenge of constructing continuous three-dimensional (3D) models that accurately represent uncertain surfaces, derived from noisy and incomplete LiDAR scanning data. Building upon our prior work, which utilized the Gaussian Process (GP) and Gaussian Mixture Model (GMM) for structured building models, we introduce a more generalized approach tailored for complex surfaces in urban scenes, where four-dimensional (4D) GMM Regression and GP with derivative observations are applied. A Hierarchical GMM (HGMM) is employed to optimize the number of GMM components and speed up the GMM training. With the prior map obtained from HGMM, GP inference is followed for the refinement of the final map. Our approach models the implicit surface of the geo-object and enables the inference of the regions that are not completely covered by measurements. The integration of GMM and GP yields well-calibrated uncertainty estimates alongside the surface model, enhancing both accuracy and reliability. The proposed method is evaluated on the real data collected by a mobile mapping system. Compared to the performance in mapping accuracy and uncertainty quantification of other methods such as Gaussian Process Implicit Surface map (GPIS) and log-Gaussian Process Implicit Surface map (Log-GPIS), the proposed method achieves lower RMSEs, higher log-likelihood values and fewer computational costs for the evaluated datasets.
In this work, we study the classical problem of verifying programs with respect to formal specifications given in the linear temporal logic (LTL). LTL is a rich and expressive logic that can specify important properties of programs. This includes, but is not limited to, termination, safety, liveness, progress, recurrence and reach-avoid properties. We first present novel sound and complete witnesses for LTL verification over imperative programs. Our witnesses are applicable to both universal (all runs) and existential (some run) settings. We then consider polynomial arithmetic programs, i.e. programs in which every assignment and guard consists only of polynomial expressions, with specifications as LTL formulas in which atomic propositions are polynomial constraints. For this setting, we provide an efficient algorithm to automatically synthesize such LTL witnesses. Our synthesis procedure is both sound and semi-complete, i.e. complete for any fixed polynomial degree in the templates. In other words, we provide the first template-based approach for polynomial programs that can handle any LTL formula as its specification. Our approach has termination guarantees with sub-exponential time complexity and generalizes and unifies previous methods for termination, safety and reachability, since they are expressible in LTL. Finally, we present experimental results demonstrating the effectiveness of our approach and that it can handle programs which were beyond the reach of previous state-of-the-art tools.
In this paper, we investigate the millimeter-wave (mmWave) near-field beam training problem to find the correct beam direction. In order to address the high complexity and low identification accuracy of existing beam training techniques, we propose an efficient hashing multi-arm beam (HMB) training scheme for the near-field scenario. Specifically, we first design a set of sparse bases based on the polar domain sparsity of the near-field channel. Then, the random hash functions are chosen to construct the near-field multi-arm beam training codebook. Each multi-arm beam codeword is scanned in a time slot until all the predefined codewords are traversed. Finally, the soft decision and voting methods are applied to distinguish the signal from different base stations and obtain correctly aligned beams. Simulation results show that our proposed near-field HMB training method can reduce the beam training overhead to the logarithmic level, and achieve 96.4% identification accuracy of exhaustive beam training. Moreover, we also verify applicability under the far-field scenario.
This work addresses the task of modeling spatiotemporal traffic patterns directly from overhead imagery, which we refer to as image-driven traffic modeling. We extend this line of work and introduce a multi-modal, multi-task transformer-based segmentation architecture that can be used to create dense city-scale traffic models. Our approach includes a geo-temporal positional encoding module for integrating geo-temporal context and a probabilistic objective function for estimating traffic speeds that naturally models temporal variations. We evaluate our method extensively using the Dynamic Traffic Speeds (DTS) benchmark dataset and significantly improve the state-of-the-art. Finally, we introduce the DTS++ dataset to support mobility-related location adaptation experiments.
In this work, we address the design of tracking controllers that drive a mechanical system's state asymptotically towards a reference trajectory. Motivated by aerospace and robotics applications, we consider fully-actuated systems evolving on the broad class of homogeneous spaces (encompassing all vector spaces, Lie groups, and spheres of any dimension). In this setting, the transitive action of a Lie group on the configuration manifold enables an intrinsic description of the tracking error as an element of the state space, even in the absence of a group structure on the configuration manifold itself (e.g., for $\mathbb{S}^2$). Such an error state facilitates the design of a generalized control policy depending smoothly on state and time that drives this geometric tracking error to a designated origin from almost every initial condition, thereby guaranteeing almost global convergence to the reference trajectory. Moreover, the proposed controller simplifies naturally when specialized to a Lie group or the $n$-sphere. In summary, we propose a unified, intrinsic controller guaranteeing almost global asymptotic trajectory tracking for fully-actuated mechanical systems evolving on a broader class of manifolds. We apply the method to an axisymmetric satellite and an omnidirectional aerial robot.
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at //github.com/KaiyangZhou/CoOp.
In this paper, we propose a deep reinforcement learning framework called GCOMB to learn algorithms that can solve combinatorial problems over large graphs. GCOMB mimics the greedy algorithm in the original problem and incrementally constructs a solution. The proposed framework utilizes Graph Convolutional Network (GCN) to generate node embeddings that predicts the potential nodes in the solution set from the entire node set. These embeddings enable an efficient training process to learn the greedy policy via Q-learning. Through extensive evaluation on several real and synthetic datasets containing up to a million nodes, we establish that GCOMB is up to 41% better than the state of the art, up to seven times faster than the greedy algorithm, robust and scalable to large dynamic networks.
Learning with limited data is a key challenge for visual recognition. Few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them is the target task. In this paper, we propose a novel approach to adapt the embedding model to the target classification task, yielding embeddings that are task-specific and are discriminative. To this end, we employ a type of self-attention mechanism called Transformer to transform the embeddings from task-agnostic to task-specific by focusing on relating instances from the test instances to the training instances in both seen and unseen classes. Our approach also extends to both transductive and generalized few-shot classification, two important settings that have essential use cases. We verify the effectiveness of our model on two standard benchmark few-shot classification datasets --- MiniImageNet and CUB, where our approach demonstrates state-of-the-art empirical performance.