Generative Neural Radiance Fields (GNeRF) based 3D-aware GANs have demonstrated remarkable capabilities in generating high-quality images while maintaining strong 3D consistency. Notably, significant advancements have been made in the domain of face generation. However, most existing models prioritize view consistency over disentanglement, resulting in limited semantic/attribute control during generation. To address this limitation, we propose a conditional GNeRF model incorporating specific attribute labels as input to enhance the controllability and disentanglement abilities of 3D-aware generative models. Our approach builds upon a pre-trained 3D-aware face model, and we introduce a Training as Init and Optimizing for Tuning (TRIOT) method to train a conditional normalized flow module to enable the facial attribute editing, then optimize the latent vector to improve attribute-editing precision further. Our extensive experiments demonstrate that our model produces high-quality edits with superior view consistency while preserving non-target regions. Code is available at //github.com/zhangqianhui/TT-GNeRF.
This paper presents GPFC, a novel Graphics Processing Unit (GPU) Phase Folding and Convolutional Neural Network (CNN) system to detect exoplanets using the transit method. We devise a fast folding algorithm parallelized on a GPU to amplify low signal-to-noise ratio transit signals, allowing a search at high precision and speed. A CNN trained on two million synthetic light curves reports a score indicating the likelihood of a planetary signal at each period. GPFC improves on speed by three orders of magnitude over the predominant Box-fitting Least Squares (BLS) method. Our simulation results show GPFC achieves 97% training accuracy, higher true positive rate at the same false positive rate of detection, and higher precision at the same recall rate when compared to BLS. GPFC recovers 100% of known ultra-short-period planets in Kepler light curves from a blind search. These results highlight the promise of GPFC as an alternative approach to the traditional BLS algorithm for finding new transiting exoplanets in data taken with Kepler and other space transit missions such as K2, TESS and future PLATO and Earth 2.0.
Electronically tunable metasurfaces, or Intelligent Reflective Surfaces (IRSs), are a popular technology for achieving high spectral efficiency in modern wireless systems by shaping channels using a multitude of tunable passive reflective elements. Capitalizing on key practical limitations of IRS-aided beamforming pertaining to system modeling and channel sensing/estimation, we propose a novel, fully data-driven Zeroth-order Stochastic Gradient Ascent (ZoSGA) algorithm for general two-stage (i.e., short/long-term), fully-passive IRS-aided stochastic utility maximization. ZoSGA learns long-term optimal IRS beamformers jointly with short-term optimal precoders (e.g., WMMSE-based) via minimal zeroth-order reinforcement and in a strictly model-free fashion, relying solely on the \textit{effective} compound channels observed at the terminals, while being independent of channel models or network/IRS configurations. Another remarkable feature of ZoSGA is being amenable to analysis, enabling us to establish a state-of-the-art (SOTA) convergence rate of the order of $\boldsymbol{O}(\sqrt{S}\epsilon^{-4})$ under minimal assumptions, where $S$ is the total number of IRS elements, and $\epsilon$ is a desired suboptimality target. Our numerical results on a standard MISO downlink IRS-aided sumrate maximization setting establish SOTA empirical behavior of ZoSGA as well, consistently and substantially outperforming standard fully model-based baselines. Lastly, we demonstrate that ZoSGA can in fact operate \textit{in the field}, by directly optimizing the capacitances of a varactor-based electromagnetic IRS model (unknown to ZoSGA) on a multiple user/IRS, compute-heavy network setting, with essentially no computational overheads or performance degradation.
We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project's website and code can be found at //sites.google.com/view/plan4mc.
This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.
Dynamic Bayesian Networks (DBNs), renowned for their interpretability, have become increasingly vital in representing complex stochastic processes in various domains such as gene expression analysis, healthcare, and traffic prediction. Structure learning of DBNs from data is challenging, particularly for datasets with thousands of variables. Most current algorithms for DBN structure learning are adaptations from those used in static Bayesian Networks (BNs), and are typically focused on small-scale problems. In order to solve large-scale problems while taking full advantage of existing algorithms, this paper introduces a novel divide-and-conquer strategy, originally developed for static BNs, and adapts it for large-scale DBN structure learning. In this work, we specifically concentrate on 2 Time-sliced Bayesian Networks (2-TBNs), a special class of DBNs. Furthermore, we leverage the prior knowledge of 2-TBNs to enhance the performance of the strategy we introduce. Our approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate the effectiveness of our method, showing substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy. On problem instances with more than 1,000 variables, our approach improves two accuracy metrics by 74.45% and 110.94% on average , respectively, while reducing runtime by 93.65% on average.
Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose \textit{ReCaption}, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named \textit{Fine-Grained Object Hallucination Evaluation} (\textit{FGHE}). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at //github.com/Anonymousanoy/FOHE.
This paper proposes a simple but highly efficient expansion-based model for continual learning. The recent feature transformation, masking and factorization-based methods are efficient, but they grow the model only over the global or shared parameter. Therefore, these approaches do not fully utilize the previously learned information because the same task-specific parameter forgets the earlier knowledge. Thus, these approaches show limited transfer learning ability. Moreover, most of these models have constant parameter growth for all tasks, irrespective of the task complexity. Our work proposes a simple filter and channel expansion based method that grows the model over the previous task parameters and not just over the global parameter. Therefore, it fully utilizes all the previously learned information without forgetting, which results in better knowledge transfer. The growth rate in our proposed model is a function of task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Recent expansion based models show promising results for task incremental learning (TIL). However, for class incremental learning (CIL), prediction of task id is a crucial challenge; hence, their results degrade rapidly as the number of tasks increase. In this work, we propose a robust task prediction method that leverages entropy weighted data augmentations and the models gradient using pseudo labels. We evaluate our model on various datasets and architectures in the TIL, CIL and generative continual learning settings. The proposed approach shows state-of-the-art results in all these settings. Our extensive ablation studies show the efficacy of the proposed components.
Existing knowledge graph (KG) embedding models have primarily focused on static KGs. However, real-world KGs do not remain static, but rather evolve and grow in tandem with the development of KG applications. Consequently, new facts and previously unseen entities and relations continually emerge, necessitating an embedding model that can quickly learn and transfer new knowledge through growth. Motivated by this, we delve into an expanding field of KG embedding in this paper, i.e., lifelong KG embedding. We consider knowledge transfer and retention of the learning on growing snapshots of a KG without having to learn embeddings from scratch. The proposed model includes a masked KG autoencoder for embedding learning and update, with an embedding transfer strategy to inject the learned knowledge into the new entity and relation embeddings, and an embedding regularization method to avoid catastrophic forgetting. To investigate the impacts of different aspects of KG growth, we construct four datasets to evaluate the performance of lifelong KG embedding. Experimental results show that the proposed model outperforms the state-of-the-art inductive and lifelong embedding baselines.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis, thereby allowing manual manipulation in predicting the final answer.
We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.