For an infinite class of finite graphs of unbounded size, we define a limit object, to be called wide limit, relative to some computationally restricted class of functions. The properties of the wide limit then reflect how a computationally restricted viewer "sees" a generic instance from the class. The construction uses arithmetic forcing with random variables [10]. We prove sufficient conditions for universal and existential sentences to be valid in the limit, provide several examples, and prove that such a limit object can then be expanded to a model of weak arithmetic. We then take the wide limit of all finite pointed paths to obtain a model of arithmetic where the problem OntoWeakPigeon is total but Leaf (the complete problem for $\textbf{PPA}$) is not. This logical separation of the oracle classes of total NP search problems in our setting implies that Leaf is not reducible to OntoWeakPigeon even if some errors are allowed in the reductions.
We propose a data-driven approach to explicitly learn the progressive encoding of a continuous source, which is successively decoded with increasing levels of quality and with the aid of correlated side information. This setup refers to the successive refinement of the Wyner-Ziv coding problem. Assuming ideal Slepian-Wolf coding, our approach employs recurrent neural networks (RNNs) to learn layered encoders and decoders for the quadratic Gaussian case. The models are trained by minimizing a variational bound on the rate-distortion function of the successively refined Wyner-Ziv coding problem. We demonstrate that RNNs can explicitly retrieve layered binning solutions akin to scalable nested quantization. Moreover, the rate-distortion performance of the scheme is on par with the corresponding monolithic Wyner-Ziv coding approach and is close to the rate-distortion bound.
We extend classical methods of computational complexity to the setting of distributed computing, where they are sometimes more effective than in their original context. Our focus is on distributed decision in the LOCAL model, where multiple networked computers communicate via synchronous message-passing to collectively answer a question about their network topology. Rather unusually, we impose two orthogonal constraints on the running time of this model: the number of communication rounds is bounded by a constant, and the number of computation steps of each computer is polynomially bounded by the size of its local input and the messages it receives. By letting two players take turns assigning certificates to all computers in the network, we obtain a generalization of the polynomial hierarchy (and hence of the complexity classes $\mathbf{P}$ and $\mathbf{NP}$). We then extend some key results of complexity theory to this setting, in particular the Cook-Levin theorem (which identifies Boolean satisfiability as a complete problem for $\mathbf{NP}$), and Fagin's theorem (which characterizes $\mathbf{NP}$ as the problems expressible in existential second-order logic). The original results can be recovered as the special case where the network consists of a single computer. But perhaps more surprisingly, the task of separating complexity classes becomes easier in the general case: we can show that our hierarchy is infinite, while it remains notoriously open whether the same is true in the case of a single computer. (By contrast, a collapse of our hierarchy would have implied a collapse of the polynomial hierarchy.) As an application, we propose quantifier alternation as a new approach to measuring the locality of problems in distributed computing.
Pre-trained Generative models such as BART, T5, etc. have gained prominence as a preferred method for text generation in various natural language processing tasks, including abstractive long-form question answering (QA) and summarization. However, the potential of generative models in extractive QA tasks, where discriminative models are commonly employed, remains largely unexplored. Discriminative models often encounter challenges associated with label sparsity, particularly when only a small portion of the context contains the answer. The challenge is more pronounced for multi-span answers. In this work, we introduce a novel approach that uses the power of pre-trained generative models to address extractive QA tasks by generating indexes corresponding to context tokens or sentences that form part of the answer. Through comprehensive evaluations on multiple extractive QA datasets, including MultiSpanQA, BioASQ, MASHQA, and WikiQA, we demonstrate the superior performance of our proposed approach compared to existing state-of-the-art models.
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: //github.com/microsoft/SoM.
Data depth is an efficient tool for robustly summarizing the distribution of functional data and detecting potential magnitude and shape outliers. Commonly used functional data depth notions, such as the modified band depth and extremal depth, are estimated from pointwise depth for each observed functional observation. However, these techniques require calculating one single depth value for each functional observation, which may not be sufficient to characterize the distribution of the functional data and detect potential outliers. This paper presents an innovative approach to make the best use of pointwise depth. We propose using the pointwise depth distribution for magnitude outlier visualization and the correlation between pairwise depth for shape outlier detection. Furthermore, a bootstrap-based testing procedure has been introduced for the correlation to test whether there is any shape outlier. The proposed univariate methods are then extended to bivariate functional data. The performance of the proposed methods is examined and compared to conventional outlier detection techniques by intensive simulation studies. In addition, the developed methods are applied to simulated solar energy datasets from a photovoltaic system. Results revealed that the proposed method offers superior detection performance over conventional techniques. These findings will benefit engineers and practitioners in monitoring photovoltaic systems by detecting unnoticed anomalies and outliers.
We develop a new class of spatial voting models for binary preference data that can accommodate both monotonic and non-monotonic response functions, and are more flexible than alternative "unfolding" models previously introduced in the literature. We then use these models to estimate revealed preferences for legislators in the U.S. House of Representatives and justices on the U.S. Supreme Court. The results from these applications indicate that the new models provide superior complexity-adjusted performance to various alternatives and also that the additional flexibility leads to preferences' estimates that are closer matches to the perceived ideological positions of legislators and justices.
Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{//github.com/LeapLabTHU/Rank-DETR}.
Sequential recommendation as an emerging topic has attracted increasing attention due to its important practical significance. Models based on deep learning and attention mechanism have achieved good performance in sequential recommendation. Recently, the generative models based on Variational Autoencoder (VAE) have shown the unique advantage in collaborative filtering. In particular, the sequential VAE model as a recurrent version of VAE can effectively capture temporal dependencies among items in user sequence and perform sequential recommendation. However, VAE-based models suffer from a common limitation that the representational ability of the obtained approximate posterior distribution is limited, resulting in lower quality of generated samples. This is especially true for generating sequences. To solve the above problem, in this work, we propose a novel method called Adversarial and Contrastive Variational Autoencoder (ACVAE) for sequential recommendation. Specifically, we first introduce the adversarial training for sequence generation under the Adversarial Variational Bayes (AVB) framework, which enables our model to generate high-quality latent variables. Then, we employ the contrastive loss. The latent variables will be able to learn more personalized and salient characteristics by minimizing the contrastive loss. Besides, when encoding the sequence, we apply a recurrent and convolutional structure to capture global and local relationships in the sequence. Finally, we conduct extensive experiments on four real-world datasets. The experimental results show that our proposed ACVAE model outperforms other state-of-the-art methods.
Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: //github.com/JDAI-CV/LIO.
Image segmentation is an important component of many image understanding systems. It aims to group pixels in a spatially and perceptually coherent manner. Typically, these algorithms have a collection of parameters that control the degree of over-segmentation produced. It still remains a challenge to properly select such parameters for human-like perceptual grouping. In this work, we exploit the diversity of segments produced by different choices of parameters. We scan the segmentation parameter space and generate a collection of image segmentation hypotheses (from highly over-segmented to under-segmented). These are fed into a cost minimization framework that produces the final segmentation by selecting segments that: (1) better describe the natural contours of the image, and (2) are more stable and persistent among all the segmentation hypotheses. We compare our algorithm's performance with state-of-the-art algorithms, showing that we can achieve improved results. We also show that our framework is robust to the choice of segmentation kernel that produces the initial set of hypotheses.