The ability to rapidly test a variety of algorithms for an arbitrary state estimation task is valuable in the prototyping phase of navigation systems. Lie group theory is now mainstream in the robotics community, and hence estimation prototyping tools should allow state definitions that belong to manifolds. A new package, called navlie, provides a framework that allows a user to model a large class of problems by implementing a set of classes complying with a generic interface. Once accomplished, navlie provides a variety of on-manifold estimation algorithms that can run directly on these classes. The package also provides a built-in library of common models, as well as many useful utilities. The open-source project can be found at //github.com/decargroup/navlie.
We investigate the problem of multimodal search of target modality, where the task involves enhancing a query in a specific target modality by integrating information from auxiliary modalities. The goal is to retrieve relevant objects whose contents in the target modality match the specified multimodal query. The paper first introduces two baseline approaches that integrate techniques from the Database, Information Retrieval, and Computer Vision communities. These baselines either merge the results of separate vector searches for each modality or perform a single-channel vector search by fusing all modalities. However, both baselines have limitations in terms of efficiency and accuracy as they fail to adequately consider the varying importance of fusing information across modalities. To overcome these limitations, the paper proposes a novel framework, called MUST. Our framework employs a hybrid fusion mechanism, combining different modalities at multiple stages. Notably, we leverage vector weight learning to determine the importance of each modality, thereby enhancing the accuracy of joint similarity measurement. Additionally, the proposed framework utilizes a fused proximity graph index, enabling efficient joint search for multimodal queries. MUST offers several other advantageous properties, including pluggable design to integrate any advanced embedding techniques, user flexibility to customize weight preferences, and modularized index construction. Extensive experiments on real-world datasets demonstrate the superiority of MUST over the baselines in terms of both search accuracy and efficiency. Our framework achieves over 10x faster search times while attaining an average of 93% higher accuracy. Furthermore, MUST exhibits scalability to datasets containing more than 10 million data elements.
Semi-supervised learning has attracted much attention due to its less dependence on acquiring abundant annotations from experts compared to fully supervised methods, which is especially important for medical image segmentation which typically requires intensive pixel/voxel-wise labeling by domain experts. Although semi-supervised methods can improve the performance by utilizing unlabeled data, there are still gaps between fully supervised methods under extremely limited annotation scenarios. In this paper, we propose a simple yet efficient strategy to explore the usage of the Segment Anything Model (SAM) for enhancing semi-supervised medical image segmentation. Concretely, the segmentation model trained with domain knowledge provides information for localization and generating input prompts to the SAM. Then the generated pseudo-labels of SAM are utilized as additional supervision to assist in the learning procedure of the semi-supervised framework. Experimental results demonstrate that SAM's assistance significantly enhances the performance of existing semi-supervised frameworks, especially when only one or a few labeled images are available.
To assess the quality of a probabilistic prediction for stochastic dynamical systems (SDSs), scoring rules assign a numerical score based on the predictive distribution and the measured state. In this paper, we propose an $\epsilon$-logarithm score that generalizes the celebrated logarithm score by considering a neighborhood with radius $\epsilon$. To begin with, we prove that the $\epsilon$-logarithm score is proper (the expected score is optimized when the predictive distribution meets the ground truth) based on discrete approximations. Then, we characterize the probabilistic predictability of an SDS by the optimal expected score and approximate it with an error of scale $\mathcal{O}(\epsilon)$. The approximation quantitatively shows how the system predictability is jointly determined by the neighborhood radius, the differential entropies of process noises, and the system dimension. In addition to the expected score, we also analyze the asymptotic behaviors of the score on individual trajectories. Specifically, we prove that the score on a trajectory will converge to the probabilistic predictability when the process noises are independent and identically distributed. Moreover, the convergence speed against the trajectory length $T$ is of scale $\mathcal{O}(T^{-\frac{1}{2}})$ in the sense of probability. Finally, we apply the predictability analysis to design unpredictable SDSs. Numerical examples are given to elaborate the results.
We present IntrinsicAvatar, a novel approach to recovering the intrinsic properties of clothed human avatars including geometry, albedo, material, and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However, these methods bake intrinsic properties such as albedo, material, and environment lighting into a single entangled neural representation. On the other hand, only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work, we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process, and combine ray tracing with body articulation. Our approach can recover high-quality geometry, albedo, material, and lighting properties of clothed humans from a single monocular video, without requiring supervised pre-training using ground truth materials. Furthermore, since we explicitly model the volumetric scattering process and ray tracing, our model naturally generalizes to novel poses, enabling animation of the reconstructed avatar in novel lighting conditions.
The guesswork refers to the distribution of the minimum number of trials needed to guess a realization of a random variable accurately. In this study, a non-trivial generalization of the guesswork called guessing cost (also referred to as cost of guessing) is introduced, and an optimal strategy for finding the $\rho$-th moment of guessing cost is provided for a random variable defined on a finite set whereby each choice is associated with a positive finite cost value (unit cost corresponds to the original guesswork). Moreover, we drive asymptotically tight upper and lower bounds on the logarithm of guessing cost moments. Similar to previous studies on the guesswork, established bounds on the moments of guessing cost quantify the accumulated cost of guesses required for correctly identifying the unknown choice and are expressed in terms of R\'enyi's entropy. Moreover, new random variables are introduced to establish connections between the guessing cost and the guesswork, leading to induced strategies. Establishing this implicit connection helped us obtain improved bounds for the non-asymptotic region. As a consequence, we establish the guessing cost exponent in terms of R\'enyi entropy rate on the moments of the guessing cost using the optimal strategy by considering a sequence of independent random variables with different cost distributions. Finally, with slight modifications to the original problem, these results are shown to be applicable for bounding the overall repair bandwidth for distributed data storage systems backed up by base stations and protected by bipartite graph codes.
In fairness audits, a standard objective is to detect whether a given algorithm performs substantially differently between subgroups. Properly powering the statistical analysis of such audits is crucial for obtaining informative fairness assessments, as it ensures a high probability of detecting unfairness when it exists. However, limited guidance is available on the amount of data necessary for a fairness audit, lacking directly applicable results concerning commonly used fairness metrics. Additionally, the consideration of unequal subgroup sample sizes is also missing. In this tutorial, we address these issues by providing guidance on how to determine the required subgroup sample sizes to maximize the statistical power of hypothesis tests for detecting unfairness. Our findings are applicable to audits of binary classification models and multiple fairness metrics derived as summaries of the confusion matrix. Furthermore, we discuss other aspects of audit study designs that can increase the reliability of audit results.
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
Estimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive to their parameter values. This produces deep local maxima in the likelihood function -- a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.
Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade. Recently, transformers-based architectures, originally introduced in natural language processing, have pervaded computer vision field where the self-attention mechanism has been utilized as a replacement to the popular convolution operator for capturing long-range dependencies. Inspired by recent advances in computer vision, remote sensing community has also witnessed an increased exploration of vision transformers for a diverse set of tasks. Although a number of surveys have focused on transformers in computer vision in general, to the best of our knowledge we are the first to present a systematic review of recent advances based on transformers in remote sensing. Our survey covers more than 60 recent transformers-based methods for different remote sensing problems in sub-areas of remote sensing: very high-resolution (VHR), hyperspectral (HSI) and synthetic aperture radar (SAR) imagery. We conclude the survey by discussing different challenges and open issues of transformers in remote sensing. Additionally, we intend to frequently update and maintain the latest transformers in remote sensing papers with their respective code at: //github.com/VIROBO-15/Transformer-in-Remote-Sensing
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.