Designing online algorithms with machine learning predictions is a recent technique beyond the worst-case paradigm for various practically relevant online problems (scheduling, caching, clustering, ski rental, etc.). While most previous learning-augmented algorithm approaches focus on integrating the predictions of a single oracle, we study the design of online algorithms with \emph{multiple} experts. To go beyond the popular benchmark of a static best expert in hindsight, we propose a new \emph{dynamic} benchmark (linear combinations of predictions that change over time). We present a competitive algorithm in the new dynamic benchmark with a performance guarantee of $O(\log K)$, where $K$ is the number of experts, for $0-1$ online optimization problems. Furthermore, our multiple-expert approach provides a new perspective on how to combine in an online manner several online algorithms - a long-standing central subject in the online algorithm research community.
Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable \textit{at inference time}, so that they can be used in multiple contexts with diverse needs. We illustrate this with the \textbf{Pink Elephant Problem}: instructing an LLM to avoid discussing a certain entity (a ``Pink Elephant''), and instead discuss a preferred entity (``Grey Elephant''). We apply a novel simplification of Constitutional AI, \textbf{Direct Principle Feedback}, which skips the ranking of responses and uses DPO directly on critiques and revisions. Our results show that after DPF fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2 model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and performs as well as GPT-4 in on our curated test set assessing the Pink Elephant Problem.
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data that involves multiple sub-components in a flexible and interpretable fashion. Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications. The key idea is to dynamically distance data samples in the latent space and thus enhance the output diversity. Our dynamic latent separation method, inspired by atomic physics, relies on the jointly learned structures of each data sample, which also reveal the importance of each sub-component for distinguishing data samples. This approach, atom modeling, requires no supervision of the latent space and allows us to learn extra partially interpretable representations besides the original goal of a model. We empirically demonstrate that the algorithm also enhances the performance of small to larger-scale models in various classification and generation problems.
We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
We introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis) or bad behavior, e.g., An aspect of researchers is coldhearted). The evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of Mauritius is called for.) We create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. In our experiments on Pythia language models, we find that LoRA outperforms full finetuning and MEMIT. We then turn to the Backpack language model architecture because it is intended to enable targeted improvement. The Backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. We propose sense finetuning, which selects and finetunes a few ($\approx$ 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. Finally, we improve GPT-J-6B by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller Backpack, in one setting outperforming editing GPT-J itself (4.1% vs 1.0%).
The adaptive processing of structured data is a long-standing research topic in machine learning that investigates how to automatically learn a mapping from a structured input to outputs of various nature. Recently, there has been an increasing interest in the adaptive processing of graphs, which led to the development of different neural network-based methodologies. In this thesis, we take a different route and develop a Bayesian Deep Learning framework for graph learning. The dissertation begins with a review of the principles over which most of the methods in the field are built, followed by a study on graph classification reproducibility issues. We then proceed to bridge the basic ideas of deep learning for graphs with the Bayesian world, by building our deep architectures in an incremental fashion. This framework allows us to consider graphs with discrete and continuous edge features, producing unsupervised embeddings rich enough to reach the state of the art on several classification tasks. Our approach is also amenable to a Bayesian nonparametric extension that automatizes the choice of almost all model's hyper-parameters. Two real-world applications demonstrate the efficacy of deep learning for graphs. The first concerns the prediction of information-theoretic quantities for molecular simulations with supervised neural models. After that, we exploit our Bayesian models to solve a malware-classification task while being robust to intra-procedural code obfuscation techniques. We conclude the dissertation with an attempt to blend the best of the neural and Bayesian worlds together. The resulting hybrid model is able to predict multimodal distributions conditioned on input graphs, with the consequent ability to model stochasticity and uncertainty better than most works. Overall, we aim to provide a Bayesian perspective into the articulated research field of deep learning for graphs.
Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i.e., batch-wise dynamic attention-masking and triple-wise calibration). Experimental results on three datasets (i.e., NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.
Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.
We propose a new method for event extraction (EE) task based on an imitation learning framework, specifically, inverse reinforcement learning (IRL) via generative adversarial network (GAN). The GAN estimates proper rewards according to the difference between the actions committed by the expert (or ground truth) and the agent among complicated states in the environment. EE task benefits from these dynamic rewards because instances and labels yield to various extents of difficulty and the gains are expected to be diverse -- e.g., an ambiguous but correctly detected trigger or argument should receive high gains -- while the traditional RL models usually neglect such differences and pay equal attention on all instances. Moreover, our experiments also demonstrate that the proposed framework outperforms state-of-the-art methods, without explicit feature engineering.