Out-of-distribution (OOD) detection discerns OOD data where the predictor cannot make valid predictions as in-distribution (ID) data, thereby increasing the reliability of open-world classification. However, it is typically hard to collect real out-of-distribution (OOD) data for training a predictor capable of discerning ID and OOD patterns. This obstacle gives rise to data generation-based learning methods, synthesizing OOD data via data generators for predictor training without requiring any real OOD data. Related methods typically pre-train a generator on ID data and adopt various selection procedures to find those data likely to be the OOD cases. However, generated data may still coincide with ID semantics, i.e., mistaken OOD generation remains, confusing the predictor between ID and OOD data. To this end, we suggest that generated data (with mistaken OOD generation) can be used to devise an auxiliary OOD detection task to facilitate real OOD detection. Specifically, we can ensure that learning from such an auxiliary task is beneficial if the ID and the OOD parts have disjoint supports, with the help of a well-designed training procedure for the predictor. Accordingly, we propose a powerful data generation-based learning method named Auxiliary Task-based OOD Learning (ATOL) that can relieve the mistaken OOD generation. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts.
Text-based person search aims to simultaneously localize and identify the target person based on query text from uncropped scene images, which can be regarded as the unified task of person detection and text-based person retrieval task. In this work, we propose a large-scale benchmark dataset named PRW-TPS-CN based on the widely used person search dataset PRW. Our dataset contains 47,102 sentences, which means there is quite more information than existing dataset. These texts precisely describe the person images from top to bottom, which in line with the natural description order. We also provide both Chinese and English descriptions in our dataset for more comprehensive evaluation. These characteristics make our dataset more applicable. To alleviate the inconsistency between person detection and text-based person retrieval, we take advantage of the rich texts in PRW-TPS-CN dataset. We propose to aggregate multiple texts as text prototypes to maintain the prominent text features of a person, which can better reflect the whole character of a person. The overall prototypes lead to generating the image attention map to eliminate the detection misalignment causing the decrease of text-based person retrieval. Thus, the inconsistency between person detection and text-based person retrieval is largely alleviated. We conduct extensive experiments on the PRW-TPS-CN dataset. The experimental results show the PRW-TPS-CN dataset's effectiveness and the state-of-the-art performance of our approach.
We explore some connections between association schemes and the analyses of the semidefinite programming (SDP) based convex relaxations of combinatorial optimization problems in the Lov\'{a}sz--Schrijver lift-and-project hierarchy. Our analysis of the relaxations of the stable set polytope leads to bounds on the clique and stability numbers of some regular graphs reminiscent of classical bounds by Delsarte and Hoffman, as well as the notion of deeply vertex-transitive graphs -- highly symmetric graphs that we show arise naturally from some association schemes. We also study relaxations of the hypergraph matching problem, and determine exactly or provide bounds on the lift-and-project ranks of these relaxations. Our proofs for these results also inspire the study of the general hypermatching pseudo-scheme, which is an association scheme except it is generally non-commutative. We then illustrate the usefulness of obtaining commutative subschemes from non-commutative pseudo-schemes via contraction in this context.
A posteriori reduced-order models, e.g. proper orthogonal decomposition, are essential to affordably tackle realistic parametric problems. They rely on a trustful training set, that is a family of full-order solutions (snapshots) representative of all possible outcomes of the parametric problem. Having such a rich collection of snapshots is not, in many cases, computationally viable. A strategy for data augmentation, designed for parametric laminar incompressible flows, is proposed to enrich poorly populated training sets. The goal is to include in the new, artificial snapshots emerging features, not present in the original basis, that do enhance the quality of the reduced-order solution. The methodologies devised are based on exploiting basic physical principles, such as mass and momentum conservation, to devise physically-relevant, artificial snapshots at a fraction of the cost of additional full-order solutions. Interestingly, the numerical results show that the ideas exploiting only mass conservation (i.e., incompressibility) are not producing significant added value with respect to the standard linear combinations of snapshots. Conversely, accounting for the linearized momentum balance via the Oseen equation does improve the quality of the resulting approximation and therefore is an effective data augmentation strategy in the framework of viscous incompressible laminar flows.
Robotic systems are complex cyber-physical systems (CPS) commonly equipped with multiple sensors and effectors. Recent simulation methods enable the Digital Twin (DT) concept realisation. However, DT employment in robotic system development, e.g. in-development testing, is unclear. During the system development, its parts evolve from simulated mockups to physical parts which run software deployed on the actual hardware. Therefore, a design tool and a flexible development procedure ensuring the integrity of the simulated and physical parts are required. We aim to maximise the integration between a CPS's simulated and physical parts in various setups. The better integration, the better simulation-based testing coverage of the physical part (hardware and software). We propose a Domain Specification Language (DSL) based on Systems Modeling Language (SysML) that we refer to as SPSysML (Simulation-Physical System Modeling Language). SPSysML defines the taxonomy of a Simulation-Physical System (SPSys), being a CPS consisting of at least a physical or simulated part. In particular, the simulated ones can be DTs. We propose a SPSys Development Procedure (SPSysDP) that enables the maximisation of the simulation-physical integrity of SPSys by evaluating the proposed factors. SPSysDP is used to develop a complex robotic system for the INCARE project. In subsequent iterations of SPSysDP, the simulation-physical integrity of the system is maximised. As a result, the system model consists of fewer components, and a greater fraction of the system components are shared between various system setups. We implement and test the system with popular frameworks, Robot Operating System (ROS) and Gazebo simulator. SPSysML with SPSysDP enables the design of SPSys (including DT and CPS), multi-setup system development featuring maximised integrity between simulation and physical parts in its setups.
Most categorical models for dependent types have traditionally been heavily set based: contexts form a category, and for each we have a set of types in said context -- and for each type a set of terms of said type. This is the case for categories with families, categories with attributes, and natural models; in particular, all of them can be traced back to certain discrete Grothendieck fibrations. We extend this intuition to the case of general, non necessarily discrete, fibrations, so that over a given context one has not only a set but a category of types. We argue that the added structure can be attributed to a notion of subtyping that shares many features with that of coercive subtyping, in the sense that it is the product of thinking about subtyping as an abbreviation mechanism: we say that a given type $A'$ is a subtype of $A$ if there is a unique coercion from $A'$ to $A$. Whenever we need a term of type $A$, then, it suffices to have a term of type $A'$, which we can `plug-in' into $A$. For this version of subtyping we provide rules, coherences, and explicit models, and we compare and contrast it to coercive subtyping as introduced by Z. Luo and others. We conclude by suggesting how the tools we present can be employed in finding appropriate rules relating subtyping and certain type constructors.
Deep neural networks (DNNs) often fail silently with over-confident predictions on out-of-distribution (OOD) samples, posing risks in real-world deployments. Existing techniques predominantly emphasize either the feature representation space or the gradient norms computed with respect to DNN parameters, yet they overlook the intricate gradient distribution and the topology of classification regions. To address this gap, we introduce GRadient-aware Out-Of-Distribution detection in interpolated manifolds (GROOD), a novel framework that relies on the discriminative power of gradient space to distinguish between in-distribution (ID) and OOD samples. To build this space, GROOD relies on class prototypes together with a prototype that specifically captures OOD characteristics. Uniquely, our approach incorporates a targeted mix-up operation at an early intermediate layer of the DNN to refine the separation of gradient spaces between ID and OOD samples. We quantify OOD detection efficacy using the distance to the nearest neighbor gradients derived from the training set, yielding a robust OOD score. Experimental evaluations substantiate that the introduction of targeted input mix-upamplifies the separation between ID and OOD in the gradient space, yielding impressive results across diverse datasets. Notably, when benchmarked against ImageNet-1k, GROOD surpasses the established robustness of state-of-the-art baselines. Through this work, we establish the utility of leveraging gradient spaces and class prototypes for enhanced OOD detection for DNN in image classification.
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing. However, fully extracting the target mask with limited user inputs remains challenging. We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs. Regarding the last segmentation result as the initial mask, an iterative refinement process is commonly employed to continually enhance the initial mask. Nevertheless, conventional techniques suffer from sensitivity to the variance in the initial mask. To circumvent this problem, our proposed method incorporates a mask matching algorithm for ensuring consistent inferences from different types of initial masks. We also introduce a target-aware zooming algorithm to preserve object information during downsampling, balancing efficiency and accuracy. Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
We explore the concept of separating systems of vertex sets of graphs. A separating system of a set $X$ is a collection of subsets of $X$ such that for any pair of distinct elements in $X$, there exists a set in the separating system that contains exactly one of the two elements. A separating system of the vertex set of a graph $G$ is called a vertex-separating path (tree) system of $G$ if the elements of the separating system are paths (trees) in the graph $G$. In this paper, we focus on the size of the smallest vertex-separating path (tree) system for different types of graphs, including trees, grids, and maximal outerplanar graphs.
Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: //github.com/JDAI-CV/LIO.
Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.