Quality-Diversity (QD) algorithms have recently gained traction as optimisation methods due to their effectiveness at escaping local optima and capability of generating wide-ranging and high-performing solutions. Recently, Multi-Objective MAP-Elites (MOME) extended the QD paradigm to the multi-objective setting by maintaining a Pareto front in each cell of a map-elites grid. MOME achieved a global performance that competed with NSGA-II and SPEA2, two well-established Multi-Objective Evolutionary Algorithms (MOEA), while also acquiring a diverse repertoire of solutions. However, MOME is limited by non-directed genetic search mechanisms which struggle in high-dimensional search spaces. In this work, we present Multi-Objective MAP-Elites with Policy-Gradient Assistance and Crowding-based Exploration (MOME-PGX): a new QD algorithm that extends MOME to improve its data efficiency and performance. MOME-PGX uses gradient-based optimisation to efficiently drive solutions towards higher performance. It also introduces crowding-based mechanisms to create an improved exploration strategy and to encourage uniformity across Pareto fronts. We evaluate MOME-PGX in four simulated robot locomotion tasks and demonstrate that it converges faster and to a higher performance than all other baselines. We show that MOME-PGX is between 4.3 and 42 times more data-efficient than MOME and doubles the performance of MOME, NSGA-II and SPEA2 in challenging environments.
The Graph Exploration problem asks a searcher to explore an unknown environment. The environment is modeled as a graph, where the searcher needs to visit each vertex beginning at some vertex $s$. Furthermore, Treasure Hunt problems are a variation of Graph Exploration, in which the searcher needs to find a hidden treasure, which is located at some vertex $t$. In these online problems, any online algorithm performs poorly because it has too little knowledge about the instance to react adequately to the requests of the adversary. Thus, the impact of a priori knowledge is of interest. In graph problems, one form of a priori knowledge is a map of the graph. We survey the graph exploration and treasure hunt problem with an unlabeled map, which is an isomorphic copy of the graph, that is provided to the searcher. We formulate decision variants of both problems by interpreting the online problems as a game between the online algorithm (the searcher) and the adversary. The map, however, is not controllable by the adversary. The question is, whether the searcher is able to explore the graph fully or find the treasure for all possible decisions of the adversary. We prove the PSPACE-completeness of these games, whereby we analyze the variations which ask for the mere existence of a tour through the graph or path to the treasure and the variations that include costs. Additionally, we analyze the complexity of related problems that relax the path constraint, allowing multiple visits of vertices or edges, or have additional constraints, like requiring to visit specific edges.
The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
Annotation quality and quantity positively affect the learning performance of sequence labeling, a vital task in Natural Language Processing. Hiring domain experts to annotate a corpus is very costly in terms of money and time. Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose. However, the annotations collected this way are prone to human errors due to the lack of expertise of the crowd workers. Existing literature in annotation aggregation assumes that annotations are independent and thus faces challenges when handling the sequential label aggregation tasks with complex dependencies. To conquer the challenges, we propose an optimization-based method that infers the ground truth labels using annotations provided by workers for sequential labeling tasks. The proposed Aggregation method for Sequential Labels from Crowds ($AggSLC$) jointly considers the characteristics of sequential labeling tasks, workers' reliabilities, and advanced machine learning techniques. Theoretical analysis on the algorithm's convergence further demonstrates that the proposed $AggSLC$ halts after a finite number of iterations. We evaluate $AggSLC$ on different crowdsourced datasets for Named Entity Recognition (NER) tasks and Information Extraction tasks in biomedical (PICO), as well as a simulated dataset. Our results show that the proposed method outperforms the state-of-the-art aggregation methods. To achieve insights into the framework, we study the effectiveness of $AggSLC$'s components through ablation studies.
In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD samples can benefit SSL. To support our claim, i) we propose a prototype-based clustering and identification algorithm that explores the inherent similarity and difference among samples at feature level and effectively cluster them around several predefined ID and OOD prototypes, thereby enhancing feature learning and facilitating ID/OOD identification; ii) we propose an importance-based sampling method that exploits the difference in importance of each ID and OOD sample to SSL, thereby reducing the sampling bias and improving the training. Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.
Goal-conditioned hierarchical reinforcement learning (GCHRL) decomposes long-horizon tasks into sub-tasks through a hierarchical framework and it has demonstrated promising results across a variety of domains. However, the high-level policy's action space is often excessively large, presenting a significant challenge to effective exploration and resulting in potentially inefficient training. Moreover, the dynamic variability of the low-level policy introduces non-stationarity to the high-level state transition function, significantly impeding the learning of the high-level policy. In this paper, we design a measure of prospect for subgoals by planning in the goal space based on the goal-conditioned value function. Building upon the measure of prospect, we propose a landmark-guided exploration strategy by integrating the measures of prospect and novelty which aims to guide the agent to explore efficiently and improve sample efficiency. To address the non-stationarity arising from the dynamic changes of the low-level policy, we apply a state-specific regularization to the learning of low-level policy, which facilitates stable learning of the hierarchical policy. The experimental results demonstrate that our proposed exploration strategy significantly outperforms the baseline methods across multiple tasks.
In Japan, the Housing and Land Survey (HLS) provides municipality-level grouped data on household incomes. Although these data can be used for effective local policymaking, their analyses are hindered by several challenges, such as limited information attributed to grouping, the presence of non-sampled areas, and the very low frequency of implementing surveys. To address these challenges, we propose a novel grouped-data-based spatio-temporal finite mixture model to model the income distributions of multiple spatial units at multiple time points. A unique feature of the proposed method is that all the areas share common latent distributions and that the mixing proportions that include the spatial and temporal effects capture the potential area-wise heterogeneity. Thus, incorporating these effects can smooth out the quantities of interest over time and space, impute missing values, and predict future values. By treating the HLS data with the proposed method, we obtain complete maps of the income and poverty measures at an arbitrary time point, which can be used to facilitate rapid and efficient policymaking with fine granularity.
The use of smart devices (e.g., smartphones, smartwatches) and other wearables to deliver digital interventions to improve health outcomes has grown significantly in the past few years. Mobile health (mHealth) systems are excellent tools for the delivery of adaptive interventions that aim to provide the right type and amount of support, at the right time, by adapting to an individual's changing context. Micro-randomized trials (MRTs) are an increasingly common experimental design that is the main source for data-driven evidence of mHealth intervention effectiveness. To assess time-varying causal effect moderation in an MRT, individuals are intensively randomized to receive treatment over time. In addition, measurements, including individual characteristics, and context are also collected throughout the study. The effective utilization of covariate information to improve inferences regarding causal effects has been well-established in the context of randomized control trials (RCTs), where covariate adjustment is applied to leverage baseline data to address chance imbalances and improve the asymptotic efficiency of causal effect estimation. However, the application of this approach to longitudinal data, such as MRTs, has not been thoroughly explored. Recognizing the connection to Neyman Orthogonality, we propose a straightforward and intuitive method to improve the efficiency of moderated causal excursion effects by incorporating auxiliary variables. We compare the robust standard errors of our method with those of the benchmark method. The efficiency gain of our approach is demonstrated through simulation studies and an analysis of data from the Intern Health Study (NeCamp et al., 2020).
Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.
Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machine learning algorithm. If a graph update occurs later on -- specifically when nodes are added or removed -- the training has to be done all over again. This is undesirable, because of the time it takes and also because downstream models which were trained with these embeddings have to be retrained if they change significantly. In this paper, we investigate embedding updates that do not require full retraining and evaluate them in combination with various embedding models on real dynamic Knowledge Graphs covering multiple use cases. We study approaches that place newly appearing nodes optimally according to local information, but notice that this does not work well. However, we find that if we continue the training of the old embedding, interleaved with epochs during which we only optimize for the added and removed parts, we obtain good results in terms of typical metrics used in link prediction. This performance is obtained much faster than with a complete retraining and hence makes it possible to maintain embeddings for dynamic Knowledge Graphs.
Bid optimization for online advertising from single advertiser's perspective has been thoroughly investigated in both academic research and industrial practice. However, existing work typically assume competitors do not change their bids, i.e., the wining price is fixed, leading to poor performance of the derived solution. Although a few studies use multi-agent reinforcement learning to set up a cooperative game, they still suffer the following drawbacks: (1) They fail to avoid collusion solutions where all the advertisers involved in an auction collude to bid an extremely low price on purpose. (2) Previous works cannot well handle the underlying complex bidding environment, leading to poor model convergence. This problem could be amplified when handling multiple objectives of advertisers which are practical demands but not considered by previous work. In this paper, we propose a novel multi-objective cooperative bid optimization formulation called Multi-Agent Cooperative bidding Games (MACG). MACG sets up a carefully designed multi-objective optimization framework where different objectives of advertisers are incorporated. A global objective to maximize the overall profit of all advertisements is added in order to encourage better cooperation and also to protect self-bidding advertisers. To avoid collusion, we also introduce an extra platform revenue constraint. We analyze the optimal functional form of the bidding formula theoretically and design a policy network accordingly to generate auction-level bids. Then we design an efficient multi-agent evolutionary strategy for model optimization. Offline experiments and online A/B tests conducted on the Taobao platform indicate both single advertiser's objective and global profit have been significantly improved compared to state-of-art methods.