Single-point annotation in visual tasks, with the goal of minimizing labelling costs, is becoming increasingly prominent in research. Recently, visual foundation models, such as Segment Anything (SAM), have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However, SAM's class-agnostic output and high confidence in local segmentation introduce 'semantic ambiguity', posing a challenge for precise category-specific segmentation. In this paper, we introduce a cost-effective category-specific segmenter using SAM. To tackle this challenge, we have devised a Semantic-Aware Instance Segmentation Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts. SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation, with a specific focus on object category information. Moreover, we introduce the Point Distance Guidance and Box Mining Strategy to mitigate inherent challenges: 'group' and 'local' issues in weakly supervised segmentation. These strategies serve to further enhance the overall segmentation performance. The experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed SAPNet, emphasizing its semantic matching capabilities and its potential to advance point-prompted instance segmentation. The code will be made publicly available.
Sensor network virtualization enables the possibility of sharing common physical resources to multiple stakeholder applications. This paper focuses on addressing the dynamic adaptation of already assigned virtual sensor network resources to respond to time varying application demands. We propose an optimization framework that dynamically allocate applications into sensor nodes while accounting for the characteristics and limitations of the wireless sensor environment. It takes also into account the additional energy consumption related to activating new nodes and/or moving already active applications. Different objective functions related to the available energy in the nodes are analyzed. The proposed framework is evaluated by simulation considering realistic parameters from actual sensor nodes and deployed applications to assess the efficiency of the proposals.
Graph Signal Processing (GSP) based recommendation algorithms have recently attracted lots of attention due to its high efficiency. However, these methods failed to consider the importance of various interactions that reflect unique user/item characteristics and failed to utilize user and item high-order neighborhood information to model user preference, thus leading to sub-optimal performance. To address the above issues, we propose a frequency-aware graph signal processing method (FaGSP) for collaborative filtering. Firstly, we design a Cascaded Filter Module, consisting of an ideal high-pass filter and an ideal low-pass filter that work in a successive manner, to capture both unique and common user/item characteristics to more accurately model user preference. Then, we devise a Parallel Filter Module, consisting of two low-pass filters that can easily capture the hierarchy of neighborhood, to fully utilize high-order neighborhood information of users/items for more accurate user preference modeling. Finally, we combine these two modules via a linear model to further improve recommendation accuracy. Extensive experiments on six public datasets demonstrate the superiority of our method from the perspectives of prediction accuracy and training efficiency compared with state-of-the-art GCN-based recommendation methods and GSP-based recommendation methods.
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models. DiffUse reduces the required amount of preference annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
In data analysis, there continues to be a need for interpretable dimensionality reduction methods whereby instrinic meaning associated with the data is retained in the reduced space. Standard approaches such as Principal Component Analysis (PCA) and the Singular Value Decomposition (SVD) fail at this task. A popular alternative is the CUR decomposition. In an SVD-like manner, the CUR decomposition approximates a matrix $A \in \mathbb{R}^{m \times n}$ as $A \approx CUR$, where $C$ and $R$ are matrices whose columns and rows are selected from the original matrix \cite{goreinov1997theory}, \cite{mahoney2009cur}. The difficulty in constructing a CUR decomposition is in determining which columns and rows to select when forming $C$ and $R$. Current column/row selection algorithms, particularly those that rely on an SVD, become infeasible as the size of the data becomes large \cite{dong2021simpler}. We address this problem by reducing the column/row selection problem to a collection of smaller sub-problems. The basic idea is to first partition the rows/columns of a matrix, and then apply an existing selection algorithm on each piece; for illustration purposes we use the Discrete Empirical Interpolation Method (\textsf{DEIM}) \cite{sorensen2016deim}. For the first task, we consider two existing algorithms that construct a Voronoi Tessellation (VT) of the rows and columns of a given matrix. We then extend these methods to automatically adapt to the data. The result is four data-driven row/column selection methods that are well-suited for parallelization, and compatible with nearly any existing column/row selection strategy. Theory and numerical examples show the design to be competitive with the original \textsf{DEIM} routine.
Image-level weakly supervised semantic segmentation (WSSS) is a fundamental yet challenging computer vision task facilitating scene understanding and automatic driving. Most existing methods resort to classification-based Class Activation Maps (CAMs) to play as the initial pseudo labels, which tend to focus on the discriminative image regions and lack customized characteristics for the segmentation task. To alleviate this issue, we propose a novel activation modulation and recalibration (AMR) scheme, which leverages a spotlight branch and a compensation branch to obtain weighted CAMs that can provide recalibration supervision and task-specific concepts. Specifically, an attention modulation module (AMM) is employed to rearrange the distribution of feature importance from the channel-spatial sequential perspective, which helps to explicitly model channel-wise interdependencies and spatial encodings to adaptively modulate segmentation-oriented activation responses. Furthermore, we introduce a cross pseudo supervision for dual branches, which can be regarded as a semantic similar regularization to mutually refine two branches. Extensive experiments show that AMR establishes a new state-of-the-art performance on the PASCAL VOC 2012 dataset, surpassing not only current methods trained with the image-level of supervision but also some methods relying on stronger supervision, such as saliency label. Experiments also reveal that our scheme is plug-and-play and can be incorporated with other approaches to boost their performance.
Learning node embeddings that capture a node's position within the broader graph structure is crucial for many prediction tasks on graphs. However, existing Graph Neural Network (GNN) architectures have limited power in capturing the position/location of a given node with respect to all other nodes of the graph. Here we propose Position-aware Graph Neural Networks (P-GNNs), a new class of GNNs for computing position-aware node embeddings. P-GNN first samples sets of anchor nodes, computes the distance of a given target node to each anchor-set,and then learns a non-linear distance-weighted aggregation scheme over the anchor-sets. This way P-GNNs can capture positions/locations of nodes with respect to the anchor nodes. P-GNNs have several advantages: they are inductive, scalable,and can incorporate node feature information. We apply P-GNNs to multiple prediction tasks including link prediction and community detection. We show that P-GNNs consistently outperform state of the art GNNs, with up to 66% improvement in terms of the ROC AUC score.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.
Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis, thereby allowing manual manipulation in predicting the final answer.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.