Robotic devices hold promise for aiding patients in orthopedic rehabilitation. However, current robotic-assisted physiotherapy methods struggle including biomechanical metrics in their control algorithms, crucial for safe and effective therapy. This paper introduces BATON, a Biomechanics-Aware Trajectory Optimization approach to robotic Navigation of human musculoskeletal loads. The method integrates a high-fidelity musculoskeletal model of the human shoulder into real-time control of robot-patient interaction during rotator cuff tendon rehabilitation. We extract skeletal dynamics and tendon loading information from an OpenSim shoulder model to solve an optimal control problem, generating strain-minimizing trajectories. Trajectories were realized on a healthy subject by an impedance-controlled robot while estimating the state of the subject's shoulder. Target poses were prescribed to design personalized rehabilitation across a wide range of shoulder motion avoiding high-strain areas. BATON was designed with real-time capabilities, enabling continuous trajectory replanning to address unforeseen variations in tendon strain, such as those from changing muscle activation of the subject.
3D transesophageal echocardiography (3DTEE), is the recommended method for diagnosing mitral regurgitation (MR). 3DTEE provides a high-quality 3D image of the mitral valve (MV), allowing for precise segmentation and measurement of the regurgitant valve anatomy. However, manual TEE segmentations are time-consuming and prone to intra-operator variability, affecting the reliability of the measurements. To address this, we developed a fully automated pipeline using a 3D convolutional neural network (CNN) to segment MV substructures (annulus, anterior leaflet, and posterior leaflet) and quantify MV anatomy. The 3D CNN, based on a multi-decoder residual U-Net architecture, was trained and tested on a dataset comprising 100 3DTEE images with corresponding segmentations. Within the pipeline, a custom algorithm refines the CNN-based segmentations and extracts MV models, from which anatomical landmarks and features are quantified. The accuracy of the proposed method was assessed using Dice score and mean surface distance (MSD) against ground truth segmentations, and the extracted anatomical parameters were compared against a semiautomated commercial software TomTec Image Arena. The trained 3D CNN achieved an average Dice score of 0.79 and MSD of 0.47 mm for the combined segmentation of the annulus, anterior and posterior leaflet. The proposed CNN architecture outperformed a baseline residual U-Net architecture in MV substructure segmentation, and the refinement of the predicted annulus segmentation improved MSD by 8.36%. The annular and leaflet linear measurements differed by less than 7.94 mm and 3.67 mm, respectively, compared to the 3D measurements obtained with TomTec Image Arena. The proposed pipeline was faster than the commercial software, with a modeling time of 12.54 s and a quantification time of 54.42 s.
Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
In analog neuromorphic chips, designers can embed computing primitives in the intrinsic physical properties of devices and circuits, heavily reducing device count and energy consumption, and enabling high parallelism, because all devices are computing simultaneously. Neural network parameters can be stored in local analog non-volatile memories (NVMs), saving the energy required to move data between memory and logic. However, the main drawback of analog sub-threshold electronic circuits is their dramatic temperature sensitivity. In this paper, we demonstrate that a temperature compensation mechanism can be devised to solve this problem. We have designed and fabricated a chip implementing a two-layer analog neural network trained to classify low-resolution images of handwritten digits with a low-cost single-poly complementary metal-oxide-semiconductor (CMOS) process, using unconventional analog NVMs for weight storage. We demonstrate a temperature-resilient analog neuromorphic chip for image recognition operating between 10$^{\circ}$C and 60$^{\circ}$C without loss of classification accuracy, within 2\% of the corresponding software-based neural network in the whole temperature range.
With the rising imaging resolution of handheld devices, existing multi-exposure image fusion algorithms struggle to generate a high dynamic range image with ultra-high resolution in real-time. Apart from that, there is a trend to design a manageable and editable algorithm as the different needs of real application scenarios. To tackle these issues, we introduce 3D LUT technology, which can enhance images with ultra-high-definition (UHD) resolution in real time on resource-constrained devices. However, since the fusion of information from multiple images with different exposure rates is uncertain, and this uncertainty significantly trials the generalization power of the 3D LUT grid. To address this issue and ensure a robust learning space for the model, we propose using a teacher-student network to model the uncertainty on the 3D LUT grid.Furthermore, we provide an editable mode for the multi-exposure image fusion algorithm by using the implicit representation function to match the requirements in different scenarios. Extensive experiments demonstrate that our proposed method is highly competitive in efficiency and accuracy.
Accurate human activity and trajectory prediction are crucial for ensuring safe and reliable human-robot interactions in dynamic environments, such as industrial settings, with mobile robots. Datasets with fine-grained action labels for moving people in industrial environments with mobile robots are scarce, as most existing datasets focus on social navigation in public spaces. This paper introduces the TH\"OR-MAGNI Act dataset, a substantial extension of the TH\"OR-MAGNI dataset, which captures participant movements alongside robots in diverse semantic and spatial contexts. TH\"OR-MAGNI Act provides 8.3 hours of manually labeled participant actions derived from egocentric videos recorded via eye-tracking glasses. These actions, aligned with the provided TH\"OR-MAGNI motion cues, follow a long-tailed distribution with diversified acceleration, velocity, and navigation distance profiles. We demonstrate the utility of TH\"OR-MAGNI Act for two tasks: action-conditioned trajectory prediction and joint action and trajectory prediction. We propose two efficient transformer-based models that outperform the baselines to address these tasks. These results underscore the potential of TH\"OR-MAGNI Act to develop predictive models for enhanced human-robot interaction in complex environments.
Adversarial attacks pose significant challenges in 3D object recognition, especially in scenarios involving multi-view analysis where objects can be observed from varying angles. This paper introduces View-Invariant Adversarial Perturbations (VIAP), a novel method for crafting robust adversarial examples that remain effective across multiple viewpoints. Unlike traditional methods, VIAP enables targeted attacks capable of manipulating recognition systems to classify objects as specific, pre-determined labels, all while using a single universal perturbation. Leveraging a dataset of 1,210 images across 121 diverse rendered 3D objects, we demonstrate the effectiveness of VIAP in both targeted and untargeted settings. Our untargeted perturbations successfully generate a singular adversarial noise robust to 3D transformations, while targeted attacks achieve exceptional results, with top-1 accuracies exceeding 95% across various epsilon values. These findings highlight VIAPs potential for real-world applications, such as testing the robustness of 3D recognition systems. The proposed method sets a new benchmark for view-invariant adversarial robustness, advancing the field of adversarial machine learning for 3D object recognition.
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Object tracking is challenging as target objects often undergo drastic appearance changes over time. Recently, adaptive correlation filters have been successfully applied to object tracking. However, tracking algorithms relying on highly adaptive correlation filters are prone to drift due to noisy updates. Moreover, as these algorithms do not maintain long-term memory of target appearance, they cannot recover from tracking failures caused by heavy occlusion or target disappearance in the camera view. In this paper, we propose to learn multiple adaptive correlation filters with both long-term and short-term memory of target appearance for robust object tracking. First, we learn a kernelized correlation filter with an aggressive learning rate for locating target objects precisely. We take into account the appropriate size of surrounding context and the feature representations. Second, we learn a correlation filter over a feature pyramid centered at the estimated target position for predicting scale changes. Third, we learn a complementary correlation filter with a conservative learning rate to maintain long-term memory of target appearance. We use the output responses of this long-term filter to determine if tracking failure occurs. In the case of tracking failures, we apply an incrementally learned detector to recover the target position in a sliding window fashion. Extensive experimental results on large-scale benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods in terms of efficiency, accuracy, and robustness.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.