We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency and establish that $\texttt{RRL-MNL}$ is both statistically and computationally efficient, achieving a $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its regret bound scales with $\kappa^{-1}$, which is potentially large in the worst case. To improve the dependence on $\kappa^{-1}$, we propose $\texttt{ORRL-MNL}$, which estimates the value function using local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve both computational and statistical efficiency. Numerical experiments demonstrate the superior performance of the proposed algorithms.
In this study, we propose a new approach to compute the majority vote (MV) function based on modulation on conjugate-reciprocal zeros (MOCZ) and introduce three different methods. In these methods, each transmitter maps the votes to the zeros of a Huffman polynomial, and the corresponding polynomial coefficients are transmitted. The receiver evaluates the polynomial constructed by the elements of the superposed sequence at conjugate-reciprocal zero pairs and detects the MV with a direct zero-testing (DiZeT) decoder. With differential and index-based encoders, we eliminate the need for power-delay information at the receiver while improving the computation error rate (CER) performance. The proposed methods do not use instantaneous channel state information at the transmitters and receiver. Thus, they provide robustness against phase and time synchronization errors. We theoretically analyze the CERs of the proposed methods. Finally, we demonstrate their efficacy in a distributed median computation scenario.
We provide an algorithm for the simultaneous system identification and model predictive control of nonlinear systems. The algorithm has finite-time near-optimality guarantees and asymptotically converges to the optimal (non-causal) controller. Particularly, the algorithm enjoys sublinear dynamic regret, defined herein as the suboptimality against an optimal clairvoyant controller that knows how the unknown disturbances and system dynamics will adapt to its actions. The algorithm is self-supervised and applies to control-affine systems with unknown dynamics and disturbances that can be expressed in reproducing kernel Hilbert spaces. Such spaces can model external disturbances and modeling errors that can even be adaptive to the system's state and control input. For example, they can model wind and wave disturbances to aerial and marine vehicles, or inaccurate model parameters such as inertia of mechanical systems. The algorithm first generates random Fourier features that are used to approximate the unknown dynamics or disturbances. Then, it employs model predictive control based on the current learned model of the unknown dynamics (or disturbances). The model of the unknown dynamics is updated online using least squares based on the data collected while controlling the system. We validate our algorithm in both hardware experiments and physics-based simulations. The simulations include (i) a cart-pole aiming to maintain the pole upright despite inaccurate model parameters, and (ii) a quadrotor aiming to track reference trajectories despite unmodeled aerodynamic drag effects. The hardware experiments include a quadrotor aiming to track a circular trajectory despite unmodeled aerodynamic drag effects, ground effects, and wind disturbances.
Despite recent advances in improving the sample-efficiency of reinforcement learning (RL) algorithms, designing an RL algorithm that can be practically deployed in real-world environments remains a challenge. In this paper, we present Coarse-to-fine Reinforcement Learning (CRL), a framework that trains RL agents to zoom-into a continuous action space in a coarse-to-fine manner, enabling the use of stable, sample-efficient value-based RL algorithms for fine-grained continuous control tasks. Our key idea is to train agents that output actions by iterating the procedure of (i) discretizing the continuous action space into multiple intervals and (ii) selecting the interval with the highest Q-value to further discretize at the next level. We then introduce a concrete, value-based algorithm within the CRL framework called Coarse-to-fine Q-Network (CQN). Our experiments demonstrate that CQN significantly outperforms RL and behavior cloning baselines on 20 sparsely-rewarded RLBench manipulation tasks with a modest number of environment interactions and expert demonstrations. We also show that CQN robustly learns to solve real-world manipulation tasks within a few minutes of online training.
We study nested conditions, a generalization of first-order logic to a categorical setting, and provide a tableau-based (semi-decision) procedure for checking (un)satisfiability and finite model generation. This generalizes earlier results on graph conditions. Furthermore we introduce a notion of witnesses, allowing the detection of infinite models in some cases. To ensure completeness, paths in a tableau must be fair, where fairness requires that all parts of a condition are processed eventually. Since the correctness arguments are non-trivial, we rely on coinductive proof methods and up-to techniques that structure the arguments. We distinguish between two types of categories: categories where all sections are isomorphisms, allowing for a simpler tableau calculus that includes finite model generation; in categories where this requirement does not hold, model generation does not work, but we still obtain a sound and complete calculus.
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^3M$, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $\ell_1$-regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^3M$ can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that $R^3M$ is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R^3M$ improves robustness of the reward against several types of perturbations to the preference data.
We propose CoNSAL (Combining Neural networks and Symbolic regression for Analytical Lyapunov function) to construct analytical Lyapunov functions for nonlinear dynamic systems. This framework contains a neural Lyapunov function and a symbolic regression component, where symbolic regression is applied to distill the neural network to precise analytical forms. Our approach utilizes symbolic regression not only as a tool for translation but also as a means to uncover counterexamples. This procedure terminates when no counterexamples are found in the analytical formulation. Compared with previous results, our algorithm directly produces an analytical form of the Lyapunov function with improved interpretability in both the learning process and the final results. We apply our algorithm to 2-D inverted pendulum, path following, Van Der Pol Oscillator, 3-D trig dynamics, 4-D rotating wheel pendulum, 6-D 3-bus power system, and demonstrate that our algorithm successfully finds their valid Lyapunov functions.
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.
Graph Neural Networks (GNNs) have received considerable attention on graph-structured data learning for a wide variety of tasks. The well-designed propagation mechanism which has been demonstrated effective is the most fundamental part of GNNs. Although most of GNNs basically follow a message passing manner, litter effort has been made to discover and analyze their essential relations. In this paper, we establish a surprising connection between different propagation mechanisms with a unified optimization problem, showing that despite the proliferation of various GNNs, in fact, their proposed propagation mechanisms are the optimal solution optimizing a feature fitting function over a wide class of graph kernels with a graph regularization term. Our proposed unified optimization framework, summarizing the commonalities between several of the most representative GNNs, not only provides a macroscopic view on surveying the relations between different GNNs, but also further opens up new opportunities for flexibly designing new GNNs. With the proposed framework, we discover that existing works usually utilize naive graph convolutional kernels for feature fitting function, and we further develop two novel objective functions considering adjustable graph kernels showing low-pass or high-pass filtering capabilities respectively. Moreover, we provide the convergence proofs and expressive power comparisons for the proposed models. Extensive experiments on benchmark datasets clearly show that the proposed GNNs not only outperform the state-of-the-art methods but also have good ability to alleviate over-smoothing, and further verify the feasibility for designing GNNs with our unified optimization framework.
We propose a new method for event extraction (EE) task based on an imitation learning framework, specifically, inverse reinforcement learning (IRL) via generative adversarial network (GAN). The GAN estimates proper rewards according to the difference between the actions committed by the expert (or ground truth) and the agent among complicated states in the environment. EE task benefits from these dynamic rewards because instances and labels yield to various extents of difficulty and the gains are expected to be diverse -- e.g., an ambiguous but correctly detected trigger or argument should receive high gains -- while the traditional RL models usually neglect such differences and pay equal attention on all instances. Moreover, our experiments also demonstrate that the proposed framework outperforms state-of-the-art methods, without explicit feature engineering.
Recently, deep learning has achieved very promising results in visual object tracking. Deep neural networks in existing tracking methods require a lot of training data to learn a large number of parameters. However, training data is not sufficient for visual object tracking as annotations of a target object are only available in the first frame of a test sequence. In this paper, we propose to learn hierarchical features for visual object tracking by using tree structure based Recursive Neural Networks (RNN), which have fewer parameters than other deep neural networks, e.g. Convolutional Neural Networks (CNN). First, we learn RNN parameters to discriminate between the target object and background in the first frame of a test sequence. Tree structure over local patches of an exemplar region is randomly generated by using a bottom-up greedy search strategy. Given the learned RNN parameters, we create two dictionaries regarding target regions and corresponding local patches based on the learned hierarchical features from both top and leaf nodes of multiple random trees. In each of the subsequent frames, we conduct sparse dictionary coding on all candidates to select the best candidate as the new target location. In addition, we online update two dictionaries to handle appearance changes of target objects. Experimental results demonstrate that our feature learning algorithm can significantly improve tracking performance on benchmark datasets.