In a world filled with data, it is expected for a nation to take decisions informed by data. However, countries need to first collect and publish such data in a way meaningful for both citizens and policy makers. A good thematic classification could be instrumental in helping users navigate and find the right resources on a rich data repository as the one collected by Colombia's National Administrative Department of Statistics (DANE). The Visual Analytics Framework is a methodology for conducting visual analysis developed by T. Munzner et al. [T. Munzner, Visualization Analysis and Design, A K Peters Visualization Series, 1, 2014] that could help with this task. This paper presents a case study applying such framework conducted to help the DANE better visualize their data repository, and present a more understandable classification of it. It describes three main analysis tasks identified, the proposed solutions and the collection of insights generated from them.
Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a new, more practical approach enabling simultaneous counting of multiple object categories using an open vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging point prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions and heralding a new era in object counting technology.
Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.
The current state-of-the-art in artificial intelligence is impressive, especially in terms of mastery of language, but not so much in terms of mathematical reasoning. What could be missing? Can we learn something useful about that gap from how the brains of mathematicians go about their craft? This essay builds on the idea that current deep learning mostly succeeds at system 1 abilities -- which correspond to our intuition and habitual behaviors -- but still lacks something important regarding system 2 abilities -- which include reasoning and robust uncertainty estimation. It takes an information-theoretical posture to ask questions about what constitutes an interesting mathematical statement, which could guide future work in crafting an AI mathematician. The focus is not on proving a given theorem but on discovering new and interesting conjectures. The central hypothesis is that a desirable body of theorems better summarizes the set of all provable statements, for example by having a small description length while at the same time being close (in terms of number of derivation steps) to many provable statements.
We often rely on censuses of triangulations to guide our intuition in $3$-manifold topology. However, this can lead to misplaced faith in conjectures if the smallest counterexamples are too large to appear in our census. Since the number of triangulations increases super-exponentially with size, there is no way to expand a census beyond relatively small triangulations; the current census only goes up to $10$ tetrahedra. Here, we show that it is feasible to search for large and hard-to-find counterexamples by using heuristics to selectively (rather than exhaustively) enumerate triangulations. We use this idea to find counterexamples to three conjectures which ask, for certain $3$-manifolds, whether one-vertex triangulations always have a "distinctive" edge that would allow us to recognise the $3$-manifold.
Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer.
Graph Convolution Networks (GCNs) manifest great potential in recommendation. This is attributed to their capability on learning good user and item embeddings by exploiting the collaborative signals from the high-order neighbors. Like other GCN models, the GCN based recommendation models also suffer from the notorious over-smoothing problem - when stacking more layers, node embeddings become more similar and eventually indistinguishable, resulted in performance degradation. The recently proposed LightGCN and LR-GCN alleviate this problem to some extent, however, we argue that they overlook an important factor for the over-smoothing problem in recommendation, that is, high-order neighboring users with no common interests of a user can be also involved in the user's embedding learning in the graph convolution operation. As a result, the multi-layer graph convolution will make users with dissimilar interests have similar embeddings. In this paper, we propose a novel Interest-aware Message-Passing GCN (IMP-GCN) recommendation model, which performs high-order graph convolution inside subgraphs. The subgraph consists of users with similar interests and their interacted items. To form the subgraphs, we design an unsupervised subgraph generation module, which can effectively identify users with common interests by exploiting both user feature and graph structure. To this end, our model can avoid propagating negative information from high-order neighbors into embedding learning. Experimental results on three large-scale benchmark datasets show that our model can gain performance improvement by stacking more layers and outperform the state-of-the-art GCN-based recommendation models significantly.
Object detectors usually achieve promising results with the supervision of complete instance annotations. However, their performance is far from satisfactory with sparse instance annotations. Most existing methods for sparsely annotated object detection either re-weight the loss of hard negative samples or convert the unlabeled instances into ignored regions to reduce the interference of false negatives. We argue that these strategies are insufficient since they can at most alleviate the negative effect caused by missing annotations. In this paper, we propose a simple but effective mechanism, called Co-mining, for sparsely annotated object detection. In our Co-mining, two branches of a Siamese network predict the pseudo-label sets for each other. To enhance multi-view learning and better mine unlabeled instances, the original image and corresponding augmented image are used as the inputs of two branches of the Siamese network, respectively. Co-mining can serve as a general training mechanism applied to most of modern object detectors. Experiments are performed on MS COCO dataset with three different sparsely annotated settings using two typical frameworks: anchor-based detector RetinaNet and anchor-free detector FCOS. Experimental results show that our Co-mining with RetinaNet achieves 1.4%~2.1% improvements compared with different baselines and surpasses existing methods under the same sparsely annotated setting.
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
With the rise of knowledge graph (KG), question answering over knowledge base (KBQA) has attracted increasing attention in recent years. Despite much research has been conducted on this topic, it is still challenging to apply KBQA technology in industry because business knowledge and real-world questions can be rather complicated. In this paper, we present AliMe-KBQA, a bold attempt to apply KBQA in the E-commerce customer service field. To handle real knowledge and questions, we extend the classic "subject-predicate-object (SPO)" structure with property hierarchy, key-value structure and compound value type (CVT), and enhance traditional KBQA with constraints recognition and reasoning ability. We launch AliMe-KBQA in the Marketing Promotion scenario for merchants during the "Double 11" period in 2018 and other such promotional events afterwards. Online results suggest that AliMe-KBQA is not only able to gain better resolution and improve customer satisfaction, but also becomes the preferred knowledge management method by business knowledge staffs since it offers a more convenient and efficient management experience.
Events are happening in real-world and real-time, which can be planned and organized occasions involving multiple people and objects. Social media platforms publish a lot of text messages containing public events with comprehensive topics. However, mining social events is challenging due to the heterogeneous event elements in texts and explicit and implicit social network structures. In this paper, we design an event meta-schema to characterize the semantic relatedness of social events and build an event-based heterogeneous information network (HIN) integrating information from external knowledge base, and propose a novel Pair-wise Popularity Graph Convolutional Network (PP-GCN) based fine-grained social event categorization model. We propose a Knowledgeable meta-paths Instances based social Event Similarity (KIES) between events and build a weighted adjacent matrix as input to the PP-GCN model. Comprehensive experiments on real data collections are conducted to compare various social event detection and clustering tasks. Experimental results demonstrate that our proposed framework outperforms other alternative social event categorization techniques.