Cross-view geolocalization, a supplement or replacement for GPS, localizes an agent within a search area by matching ground-view images to overhead images. Significant progress has been made assuming a panoramic ground camera. Panoramic cameras' high complexity and cost make non-panoramic cameras more widely applicable, but also more challenging since they yield less scene overlap between ground and overhead images. This paper presents Restricted FOV Wide-Area Geolocalization (ReWAG), a cross-view geolocalization approach that combines a neural network and particle filter to globally localize a mobile agent with only odometry and a non-panoramic camera. ReWAG creates pose-aware embeddings and provides a strategy to incorporate particle pose into the Siamese network, improving localization accuracy by a factor of 100 compared to a vision transformer baseline. This extended work also presents ReWAG*, which improves upon ReWAG's generalization ability in previously unseen environments. ReWAG* repeatedly converges accurately on a dataset of images we have collected in Boston with a 72 degree field of view (FOV) camera, a location and FOV that ReWAG* was not trained on.
We introduce a novel modeling approach for time series imputation and forecasting, tailored to address the challenges often encountered in real-world data, such as irregular samples, missing data, or unaligned measurements from multiple sensors. Our method relies on a continuous-time-dependent model of the series' evolution dynamics. It leverages adaptations of conditional, implicit neural representations for sequential data. A modulation mechanism, driven by a meta-learning algorithm, allows adaptation to unseen samples and extrapolation beyond observed time-windows for long-term predictions. The model provides a highly flexible and unified framework for imputation and forecasting tasks across a wide range of challenging scenarios. It achieves state-of-the-art performance on classical benchmarks and outperforms alternative time-continuous models.
Local fields, and fields complete with respect to a discrete valuation, are essential objects in commutative algebra, with applications to number theory and algebraic geometry. We formalize in Lean the basic theory of discretely valued fields. In particular, we prove that the unit ball with respect to a discrete valuation on a field is a discrete valuation ring and, conversely, that the adic valuation on the field of fractions of a discrete valuation ring is discrete. We define finite extensions of valuations and of discrete valuation rings, and prove some global-to-local results. Building on this general theory, we formalize the abstract definition and some fundamental properties of local fields. As an application, we show that finite extensions of the field $\mathbb{Q}_p$ of $p$-adic numbers and of the field $\mathbb{F}_p(\!(X)\!)$ of Laurent series over $\mathbb{F}_p$ are local fields.
Non-orthogonal multiple access (NOMA) is a promising transmission scheme employed at the physical layer to improve the spectral efficiency. In this paper, we develop a novel cross-layer approach by employing NOMA at the physical layer and instantly decodable network coding (IDNC) at the network layer in downlink cellular networks. Following this approach, two IDNC packets are selected for each transmission, with one designed for all receivers and the other designed only for the strong receivers which can employ successive interference cancellation (SIC). The IDNC packets selection, transmission rates adaption for the two IDNC packets, and NOMA power allocation are jointly considered to improve the throughput of the network. Given the intractability of the problem, we decouple it into two separate subproblems, the IDNC scheduling which jointly selects the IDNC packets and the transmission rates with the given NOMA power allocation, and the NOMA power allocation with the given IDNC scheduling. The IDNC scheduling can be reduced to a maximum weight clique problem, and two heuristic algorithms named as maximum weight vertex (MWV) search and maximum weight path based maximum weight vertex (MWP-MWV) search are developed to solve the first subproblem. An iterative function evaluation (IFE) approach is proposed to solve the second subproblem. Simulation results are presented to demonstrates the throughput gain of the proposed approach over the existing solutions.
Text-To-Image (TTI) models, exemplified by DALL-E and StableDiffusion, have recently gained prominence for their remarkable zero-shot capabilities in generating images guided by textual prompts. Language, as a conduit of culture, plays a pivotal role in these models' multilingual capabilities, which in turn shape their cultural agency. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model, and human assessments, to discern TTI cultural perceptions. To facilitate our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments reveal insights into these models' cultural awareness, cultural distinctions, and the unlocking of cultural features, releasing the potential for cross-cultural applications.
We consider relational semantics (R-models) for the Lambek calculus extended with intersection and explicit constants for zero and unit. For its variant without constants and a restriction which disallows empty antecedents, Andreka and Mikulas (1994) prove strong completeness. We show that it fails without this restriction, but, on the other hand, prove weak completeness for non-standard interpretation of constants. For the standard interpretation, even weak completeness fails. The weak completeness result extends to an infinitary setting, for so-called iterative divisions (Kleene star under division). We also prove strong completeness results for product-free fragments.
The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.
The uniform one-dimensional fragment of first-order logic was introduced a few years ago as a generalization of the two-variable fragment of first-order logic to contexts involving relations of arity greater than two. Quantifiers in this logic are used in blocks, each block consisting only of existential quantifiers or only of universal quantifiers. In this paper we consider the possibility of mixing quantifiers in blocks. We identify a non-trivial variation of the logic with mixed blocks of quantifiers which retains some good properties of the two-variable fragment and of the uniform one-dimensional fragment: it has the finite (exponential) model property and hence decidable, NExpTime-complete satisfiability problem.
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs and strong requirements for the data set and accessibility of the face recognition model. By analyzing the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively, and our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process.
Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at //github.com/TQTQliu/ET-MVSNet.
A mobile manipulator often finds itself in an application where it needs to take a close-up view before performing a manipulation task. Named this as a coupled active perception and manipulation (CAPM) problem, we model the uncertainty in the perception process and devise a key state/task planning approach that considers reachability conditions as task constraints of both perception and manipulation tasks for the mobile platform. By minimizing the expected energy usage in the body key state planning while satisfying task constraints, our algorithm achieves the best balance between the task success rate and energy usage. We have implemented the algorithm and tested it in both simulation and physical experiments. The results have confirmed that our algorithm has a lower energy consumption compared to a two-stage decoupled approach, while still maintaining a success rate of 100\% for the task.