亚州AV无码专区在线电影_99国产精品久久久久99打野战_欧美91精品久久久久影视网_久久99国产6精品久久久_国产又粗又大色哟哟_国产成人精品在视频_手机免费看片网站

Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. This indicates that there exists a strong correlation between the visual and textual domains. In addition, text-image discriminative models such as CLIP excel in image labelling from text prompts, thanks to the rich and diverse information available from open concepts. In this paper, we leverage these technical advances to solve a challenging problem in computer vision: camouflaged instance segmentation. Specifically, we propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues are subtle to distinguish the objects from the background, especially in segmenting novel objects which are not seen in training. We also develop technically supportive components to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. We validate our method and compare it with existing ones on several benchmark datasets of camouflaged instance segmentation and generic open-vocabulary instance segmentation. Experimental results confirm the advances of our method over existing ones. We will publish our code and pre-trained models to support future research.

相關內容

示例

關注 0

MoDELS · Continuity · 狀態空間 · 傳感器 · 數據集 ·

2024 年 2 月 15 日

Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling

Raunaq Bhirangi,Chenyu Wang,Venkatesh Pattabiraman,Carmel Majidi,Abhinav Gupta,Tess Hellebrekers,Lerrel Pinto

Reasoning from sequences of raw sensory data is a ubiquitous problem across fields ranging from medical devices to robotics. These problems often involve using long sequences of raw sensor data (e.g. magnetometers, piezoresistors) to predict sequences of desirable physical quantities (e.g. force, inertial measurements). While classical approaches are powerful for locally-linear prediction problems, they often fall short when using real-world sensors. These sensors are typically non-linear, are affected by extraneous variables (e.g. vibration), and exhibit data-dependent drift. For many problems, the prediction task is exacerbated by small labeled datasets since obtaining ground-truth labels requires expensive equipment. In this work, we present Hierarchical State-Space Models (HiSS), a conceptually simple, new technique for continuous sequential prediction. HiSS stacks structured state-space models on top of each other to create a temporal hierarchy. Across six real-world sensor datasets, from tactile-based state prediction to accelerometer-based inertial measurement, HiSS outperforms state-of-the-art sequence models such as causal Transformers, LSTMs, S4, and Mamba by at least 23% on MSE. Our experiments further indicate that HiSS demonstrates efficient scaling to smaller datasets and is compatible with existing data-filtering techniques. Code, datasets and videos can be found on //hiss-csp.github.io.

GPT-4V · MoDELS · Extensibility · 可辨認的 · 成對型 ·

2024 年 2 月 15 日

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

Yong Cao,Wenyan Li,Jiaang Li,Yifei Yuan,Antonia Karamolegkou,Daniel Hershcovich

from arxiv, work in process

Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.

推斷 · 相關系數 · 通道 · 貝葉斯推斷 · 稀疏 ·

2024 年 2 月 15 日

Bayesian Inference on Brain-Computer Interfaces via GLASS

Bangyao Zhao,Jane E. Huggins,Jian Kang

from arxiv, 32 pages, 5 figures

Brain-computer interfaces (BCIs), particularly the P300 BCI, facilitate direct communication between the brain and computers. The fundamental statistical problem in P300 BCIs lies in classifying target and non-target stimuli based on electroencephalogram (EEG) signals. However, the low signal-to-noise ratio (SNR) and complex spatial/temporal correlations of EEG signals present challenges in modeling and computation, especially for individuals with severe physical disabilities-BCI's primary users. To address these challenges, we introduce a novel Gaussian Latent channel model with Sparse time-varying effects (GLASS) under a fully Bayesian framework. GLASS is built upon a constrained multinomial logistic regression particularly designed for the imbalanced target and non-target stimuli. The novel latent channel decomposition efficiently alleviates strong spatial correlations between EEG channels, while the soft-thresholded Gaussian process (STGP) prior ensures sparse and smooth time-varying effects. We demonstrate GLASS substantially improves BCI's performance in participants with amyotrophic lateral sclerosis (ALS) and identifies important EEG channels (PO8, Oz, PO7, and Pz) in parietal and occipital regions that align with existing literature. For broader accessibility, we develop an efficient gradient-based variational inference (GBVI) algorithm for posterior computation and provide a user-friendly Python module available at //github.com/BangyaoZhao/GLASS.

Continuity · Performer · Learning · 在線 · INFORMS ·

2024 年 2 月 14 日

Memory-Efficient Continual Learning Object Segmentation for Long Video

Amir Nazemi,Mohammad Javad Shafiee,Zahra Gharaee,Paul Fieguth

Recent state-of-the-art semi-supervised Video Object Segmentation (VOS) methods have shown significant improvements in target object segmentation accuracy when information from preceding frames is used in segmenting the current frame. In particular, such memory-based approaches can help a model to more effectively handle appearance changes (representation drift) or occlusions. Ideally, for maximum performance, Online VOS methods would need all or most of the preceding frames (or their extracted information) to be stored in memory and be used for online learning in later frames. Such a solution is not feasible for long videos, as the required memory size grows without bound, and such methods can fail when memory is limited and a target object experiences repeated representation drifts throughout a video. We propose two novel techniques to reduce the memory requirement of Online VOS methods while improving modeling accuracy and generalization on long videos. Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL), which improves the performance of any Online VOS subject to limited memory, and a Reconstruction-based Memory Selection Continual Learning (RMSCL), which empowers Online VOS methods to efficiently benefit from stored information in memory. We also analyze the performance of a hybrid combination of the two proposed methods. Experimental results show that the proposed methods are able to improve the performance of Online VOS models by more than 8%, with improved robustness on long-video datasets while maintaining comparable performance on short-video datasets such as DAVIS16, DAVIS17, and YouTube-VOS18.

Networking · Extensibility · Performer · INFORMS · state-of-the-art ·

2024 年 2 月 14 日

FD-Vision Mamba for Endoscopic Exposure Correction

Zhuoran Zheng,Jun Zhang

from arxiv, arXiv admin note: substantial text overlap with arXiv:2402.04139

In endoscopic imaging, the recorded images are prone to exposure abnormalities, so maintaining high-quality images is important to assist healthcare professionals in performing decision-making. To overcome this issue, We design a frequency-domain based network, called FD-Vision Mamba (FDVM-Net), which achieves high-quality image exposure correction by reconstructing the frequency domain of endoscopic images. Specifically, inspired by the State Space Sequence Models (SSMs), we develop a C-SSM block that integrates the local feature extraction ability of the convolutional layer with the ability of the SSM to capture long-range dependencies. A two-path network is built using C-SSM as the basic function cell, and these two paths deal with the phase and amplitude information of the image, respectively. Finally, a degraded endoscopic image is reconstructed by FDVM-Net to obtain a high-quality clear image. Extensive experimental results demonstrate that our method achieves state-of-the-art results in terms of speed and accuracy, and it is noteworthy that our method can enhance endoscopic images of arbitrary resolution. The URL of the code is \url{//github.com/zzr-idam/FDVM-Net}.

不變 · 環 · 秩 · Automator · Notability ·

2024 年 2 月 12 日

Ranking LLM-Generated Loop Invariants for Program Verification

Saikat Chakraborty,Shuvendu K. Lahiri,Sarah Fakhoury,Madanlal Musuvathi,Akash Lal,Aseem Rastogi,Aditya Senthilnathan,Rahul Sharma,Nikhil Swamy

from arxiv, Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-findings 2023)

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{//github.com/microsoft/NeuralInvariantRanker}.

Med-PaLM 2 · Performer · 語言模型化 · MoDELS · 自動問答 ·

2023 年 5 月 16 日

Towards Expert-Level Medical Question Answering with Large Language Models

Karan Singhal,Tao Tu,Juraj Gottweis,Rory Sayres,Ellery Wulczyn,Le Hou,Kevin Clark,Stephen Pfohl,Heather Cole-Lewis,Darlene Neal,Mike Schaekermann,Amy Wang,Mohamed Amin,Sami Lachgar,Philip Mansfield,Sushant Prakash,Bradley Green,Ewa Dominowska,Blaise Aguera y Arcas,Nenad Tomasev,Yun Liu,Renee Wong,Christopher Semturs,S. Sara Mahdavi,Joelle Barral,Dale Webster,Greg S. Corrado,Yossi Matias,Shekoofeh Azizi,Alan Karthikesalingam,Vivek Natarajan

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

圖 · MoDELS · Continuity · 圖形處理器 · 隱藏層 ·

2020 年 6 月 7 日

Principal Neighbourhood Aggregation for Graph Nets

Gabriele Corso,Luca Cavalleri,Dominique Beaini,Pietro Liò,Petar Veli?kovi?

Graph Neural Networks (GNNs) have been shown to be effective models for different predictive tasks on graph-structured data. Recent work on their expressive power has focused on isomorphism tasks and countable feature spaces. We extend this theoretical framework to include continuous features - which occur regularly in real-world input domains and within the hidden layers of GNNs - and we demonstrate the requirement for multiple aggregation functions in this context. Accordingly, we propose Principal Neighbourhood Aggregation (PNA), a novel architecture combining multiple aggregators with degree-scalers (which generalize the sum aggregator). Finally, we compare the capacity of different models to capture and exploit the graph structure via a novel benchmark containing multiple tasks taken from classical graph theory, alongside existing benchmarks from real-world domains, all of which demonstrate the strength of our model. With this work, we hope to steer some of the GNN research towards new aggregation methods which we believe are essential in the search for powerful and robust models.

圖注意力網絡 · 情感分類 · 圖 · Networking · 注意力機制 ·

2019 年 9 月 5 日

Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks

Binxuan Huang,Kathleen M. Carley

from arxiv, Accepted by EMNLP 2019

Aspect level sentiment classification aims to identify the sentiment expressed towards an aspect given a context sentence. Previous neural network based methods largely ignore the syntax structure in one sentence. In this paper, we propose a novel target-dependent graph attention network (TD-GAT) for aspect level sentiment classification, which explicitly utilizes the dependency relationship among words. Using the dependency graph, it propagates sentiment features directly from the syntactic context of an aspect target. In our experiments, we show our method outperforms multiple baselines with GloVe embeddings. We also demonstrate that using BERT representations further substantially boosts the performance.

屬性空間 · 多樣性 · Pair · MoDELS · 訓練數據 ·

2018 年 8 月 2 日

Diverse Image-to-Image Translation via Disentangled Representations

Hsin-Ying Lee,Hung-Yu Tseng,Jia-Bin Huang,Maneesh Kumar Singh,Ming-Hsuan Yang

from arxiv, ECCV 2018 (Oral). Project page: //vllab.ucmerced.edu/hylee/DRIT/ Code: //github.com/HsinYingLee/DRIT/

Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.