Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model. We devise a novel training objective for T2I diffusion models that minimally fine-tunes the pretrained model to achieve consistency. Our method, dubbed \emph{Direct Consistency Optimization}, is as simple as regular diffusion loss, while significantly enhancing the compositionality of personalized T2I models. Also, our approach induces a new sampling method that controls the tradeoff between image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using a comprehensive caption for reference images to further enhance the image-text alignment. We show the efficacy of the proposed method on the T2I personalization for subject, style, or both. In particular, our method results in a superior Pareto frontier to the baselines. Generated examples and codes are in our project page( //dco-t2i.github.io/).
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on LVLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of LVLMs in understanding multipanel images, we analyze the potential causes for LVLMs' performance and offer insights for enhancement with the synthetic data. Code and data are released at //sites.google.com/view/multipanelvqa/home.
Over the past few years, several microring resonator (MRR)-based analog photonic architectures have been proposed to accelerate general matrix-matrix multiplications (GEMMs), which are found in abundance in deep learning workloads.These architectures have dramatically grown in popularity because they offer exceptional throughput and energy efficiency compared to their electronic counterparts. However, such architectures, due to their traditional realization based on the silicon-on-insulator (SOI) material platform, face two shortcomings. First, the high-index contrast of the SOI platform incurs high scattering losses, which mandates the provisioning of high optical input power.Second, SOI waveguides are susceptible to two-photon absorption, which can incur substantial optical signal losses at moderate-to-high signal fan-in. These shortcomings have severely detrimental effects on the achievable parallelism, throughput, and energy efficiency of SOI MRR-based GEMM accelerators. To address these shortcomings, we present a novel Silicon Nitride (SiN)-Based Photonic GEMM Accelerator called SiNPhAR. SiNPhAR architecture employs SiN-based active and passive devices to implement analog GEMM functions. Since the SiN material exhibits lower index contrast and no TPA, the optical signal losses in our SiNPhAR architecture are very low. This advantage significantly enhances the achievable processing parallelism, throughput, and energy efficiency of SiNPhAR architecture, compared to SOI-based photonic GEMM accelerators from prior work. We quantify and compare these benefits of SiNPhAR architecture via our cross-layer evaluation for a benchmark workload comprising four modern deep neural network models. From the system-level performance analysis, SiNPhAR demonstrates at least 1.7x better throughput FPS while consuming at least 2.8x better energy efficiency (FPS/W) than prior SOI-based GEMM accelerators.
Semantic segmentation plays a crucial role in various computer vision applications, yet its efficacy is often hindered by the lack of high-quality labeled data. To address this challenge, a common strategy is to leverage models trained on data from different populations, such as publicly available datasets. This approach, however, leads to the distribution shift problem, presenting a reduced performance on the population of interest. In scenarios where model errors can have significant consequences, selective prediction methods offer a means to mitigate risks and reduce reliance on expert supervision. This paper investigates selective prediction for semantic segmentation in low-resource settings, thus focusing on post-hoc confidence estimators applied to pre-trained models operating under distribution shift. We propose a novel image-level confidence measure tailored for semantic segmentation and demonstrate its effectiveness through experiments on three medical imaging tasks. Our findings show that post-hoc confidence estimators offer a cost-effective approach to reducing the impacts of distribution shift.
Many successful methods to learn dynamical systems from data have recently been introduced. However, ensuring that the inferred dynamics preserve known constraints, such as conservation laws or restrictions on the allowed system states, remains challenging. We propose stabilized neural differential equations (SNDEs), a method to enforce arbitrary manifold constraints for neural differential equations. Our approach is based on a stabilization term that, when added to the original dynamics, renders the constraint manifold provably asymptotically stable. Due to its simplicity, our method is compatible with all common neural differential equation (NDE) models and broadly applicable. In extensive empirical evaluations, we demonstrate that SNDEs outperform existing methods while broadening the types of constraints that can be incorporated into NDE training.
In this work, we present some new results for compressed sensing and phase retrieval. For compressed sensing, it is shown that if the unknown $n$-dimensional vector can be expressed as a linear combination of $s$ unknown Vandermonde vectors (with Fourier vectors as a special case) and the measurement matrix is a Vandermonde matrix, exact recovery of the vector with $2s$ measurements and $O(\mathrm{poly}(s))$ complexity is possible when $n \geq 2s$. Based on this result, a new class of measurement matrices is presented from which it is possible to recover $s$-sparse $n$-dimensional vectors for $n \geq 2s$ with as few as $2s$ measurements and with a recovery algorithm of $O(\mathrm{poly}(s))$ complexity. In the second part of the work, these results are extended to the challenging problem of phase retrieval. The most significant discovery in this direction is that if the unknown $n$-dimensional vector is composed of $s$ frequencies with at least one being non-harmonic, $n \geq 4s - 1$ and we take at least $8s-3$ Fourier measurements, there are, remarkably, only two possible vectors producing the observed measurement values and they are easily obtainable from each other. The two vectors can be found by an algorithm with only $O(\mathrm{poly}(s))$ complexity. An immediate application of the new result is construction of a measurement matrix from which it is possible to recover almost all $s$-sparse $n$-dimensional signals (up to a global phase) from $O(s)$ magnitude-only measurements and $O(\mathrm{poly}(s))$ recovery complexity when $n \geq 4s - 1$.
Sequential recommendation aims to leverage users' historical behaviors to predict their next interaction. Existing works have not yet addressed two main challenges in sequential recommendation. First, user behaviors in their rich historical sequences are often implicit and noisy preference signals, they cannot sufficiently reflect users' actual preferences. In addition, users' dynamic preferences often change rapidly over time, and hence it is difficult to capture user patterns in their historical sequences. In this work, we propose a graph neural network model called SURGE (short for SeqUential Recommendation with Graph neural nEtworks) to address these two issues. Specifically, SURGE integrates different types of preferences in long-term user behaviors into clusters in the graph by re-constructing loose item sequences into tight item-item interest graphs based on metric learning. This helps explicitly distinguish users' core interests, by forming dense clusters in the interest graph. Then, we perform cluster-aware and query-aware graph convolutional propagation and graph pooling on the constructed graph. It dynamically fuses and extracts users' current activated core interests from noisy user behavior sequences. We conduct extensive experiments on both public and proprietary industrial datasets. Experimental results demonstrate significant performance gains of our proposed method compared to state-of-the-art methods. Further studies on sequence length confirm that our method can model long behavioral sequences effectively and efficiently.
Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.
Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.
Inspired by recent development of artificial satellite, remote sensing images have attracted extensive attention. Recently, noticeable progress has been made in scene classification and target detection.However, it is still not clear how to describe the remote sensing image content with accurate and concise sentences. In this paper, we investigate to describe the remote sensing images with accurate and flexible sentences. First, some annotated instructions are presented to better describe the remote sensing images considering the special characteristics of remote sensing images. Second, in order to exhaustively exploit the contents of remote sensing images, a large-scale aerial image data set is constructed for remote sensing image caption. Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption. Extensive experiments on the proposed data set demonstrate that the content of the remote sensing image can be completely described by generating language descriptions. The data set is available at //github.com/2051/RSICD_optimal