Out-of-distribution (OOD) detection empowers the model trained on the closed image set to identify unknown data in the open world. Though many prior techniques have yielded considerable improvements in this research direction, two crucial obstacles still remain. Firstly, a unified perspective has yet to be presented to view the developed arts with individual designs, which is vital for providing insights into future work. Secondly, we expect sufficient natural OOD supervision to promote the generation of compact boundaries between the in-distribution (ID) and OOD data without collecting explicit OOD samples. To tackle these issues, we propose a general probabilistic framework to interpret many existing methods and an OOD-data-free model, namely \textbf{S}elf-supervised \textbf{S}ampling for \textbf{O}OD \textbf{D}etection (SSOD). SSOD efficiently exploits natural OOD signals from the ID data based on the local property of convolution. With these supervisions, it jointly optimizes the OOD detection and conventional ID classification in an end-to-end manner. Extensive experiments reveal that SSOD establishes competitive state-of-the-art performance on many large-scale benchmarks, outperforming the best previous method by a large margin, \eg, reporting \textbf{-6.28\%} FPR95 and \textbf{+0.77\%} AUROC on ImageNet, \textbf{-19.01\%} FPR95 and \textbf{+3.04\%} AUROC on CIFAR-10, and top-ranked performance on hard OOD datasets, \ie, ImageNet-O and OpenImage-O.
Out-of-distribution detection is a crucial technique for deploying machine learning models in the real world to handle the unseen scenarios. In this paper, we propose a simple but effective Neural Activation Prior (NAP) for out-of-distribution detection (OOD). Our neural activation prior is based on a key observation that, for a channel before the global pooling layer of a fully trained neural network, the probability of a few of its neurons being activated with a larger response by an in-distribution (ID) sample is significantly higher than that by an OOD sample. An intuitive explanation is each channel in a model fully trained on ID dataset would play a role in detecting a certain pattern in the samples within the ID dataset, and a few neurons can be activated with a large response when the pattern is detected in an input sample. Thus, a new scoring function based on this prior is proposed to highlight the role of these strongly activated neurons in OOD detection. This approach is plug-and-play and does not lead to any performance degradation on in-distribution data classification and requires no extra training or statistics from training or external datasets. Notice that previous methods primarily rely on post-global-pooling features of the neural networks, while the within-channel distribution information we leverage would be discarded by the global pooling operator. Consequently, our method is orthogonal to existing approaches and can be effectively combined with them in various applications. Experimental results show that our method achieves the state-of-the-art performance on CIFAR-10, CIFAR-100 and ImageNet datasets, which demonstrates the power of the proposed prior.
The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, the presence of inherent noise invariably compromises the robustness of training dense retrieval models, consequently hurting the performance. In this paper, we introduce WebDRO, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of the pretraining process of dense retrieval models on web graphs. Initially, we build an embedding model for clustering anchor-document pairs. Specifically, we contrastively train the embedding model for link prediction, which guides the embedding model in capturing the inherent document features behind the web graph links. Subsequently, we employ the group distributional robust optimization to recalibrate the weights across different clusters of anchor-document pairs during training dense retrieval models, directing the model to assign higher weights to clusters with higher loss and focus more on worst-case scenarios. Our experiments conducted on MS MARCO and BEIR demonstrate that our method can effectively improve retrieval performance in unsupervised training settings. Further analysis confirms the stability and validity of group weights learned by WebDRO. All codes will be released via GitHub.
Two-dimensional digital image correlation (2D-DIC) is a widely used optical technique to measure displacement and strain during asphalt concrete (AC) testing. An accurate 2-D DIC measurement can only be achieved when the camera's principal axis is perpendicular to the planar specimen surface. However, this requirement may not be met during testing due to device constraints. This paper proposes a simple and reliable method to correct errors induced by non-perpendicularity. The method is based on image feature matching and rectification. No additional equipment is needed. A theoretical error analysis was conducted to quantify the effect of a non-perpendicular camera alignment on measurement accuracy. The proposed method was validated numerically using synthetic images and experimentally in an AC fracture test. It achieved relatively high accuracy, even under considerable camera rotation angle and large deformation. As a pre-processing technique, the proposed method showed promising performance in assisting the recently developed CrackPropNet for automated crack propagation measurement under a non-perpendicular camera alignment.
Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations, but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework, which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition, we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer, respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
The properties of the generalized Waring distribution defined on the non negative integers are reviewed. Formulas for its moments and its mode are given. A construction as a mixture of negative binomial distributions is also presented. Then we turn to the Petersen model for estimating the population size $N$ in a two-way capture recapture experiment. We construct a Bayesian model for $N$ by combining a Waring prior with the hypergeometric distribution for the number of units caught twice in the experiment. Credible intervals for $N$ are obtained using quantiles of the posterior, a generalized Waring distribution. The standard confidence interval for the population size constructed using the asymptotic variance of Petersen estimator and .5 logit transformed interval are shown to be special cases of the generalized Waring credible interval. The true coverage of this interval is shown to be bigger than or equal to its nominal converage in small populations, regardless of the capture probabilities. In addition, its length is substantially smaller than that of the .5 logit transformed interval. Thus the proposed generalized Waring credible interval appears to be the best way to quantify the uncertainty of the Petersen estimator for populations size.
With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25%). The second step involves preserving dataset diversity through a clustering process.In our experiment, CaR selected a subset containing only 1.96% of Alpaca's IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. Furthermore, our method utilizes small models (355M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.
Simulation-based inference (SBI) is constantly in search of more expressive algorithms for accurately inferring the parameters of complex models from noisy data. We present consistency models for neural posterior estimation (CMPE), a new free-form conditional sampler for scalable, fast, and amortized SBI with generative neural networks. CMPE combines the advantages of normalizing flows and flow matching methods into a single generative architecture: It essentially distills a continuous probability flow and enables rapid few-shot inference with an unconstrained architecture that can be tailored to the structure of the estimation problem. Our empirical evaluation demonstrates that CMPE not only outperforms current state-of-the-art algorithms on three hard low-dimensional problems but also achieves competitive performance in a high-dimensional Bayesian denoising experiment and in estimating a computationally demanding multi-scale model of tumor spheroid growth.
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autonomously navigate neighborhood streets. In response to these developments, new paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning. These paradigms leverage the existence of ever-larger datasets curated for multimodal, multitask, and generalist interaction. Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications such as dialogue, autonomous driving, healthcare, education, and robotics. In this manuscript, we examine the scope of foundation models for decision making, and provide conceptual tools and technical background for understanding the problem space and exploring new research directions. We review recent approaches that ground foundation models in practical decision making applications through a variety of methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, and discuss common challenges and open problems in the field.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.