The Reward-Penalty-Selection Problem (RPSP) can be seen as a combination of the Set Cover Problem (SCP) and the Hitting Set Problem (HSP). Given a set of elements, a set of reward sets, and a set of penalty sets, one tries to find a subset of elements such that as many reward sets as possible are covered, i.e. all elements are contained in the subset, and at the same time as few penalty sets as possible are hit, i.e. the intersection of the subset with the penalty set is non-empty. In this paper we define a cooperative game based on the RPSP where the elements of the RPSP are the players. We prove structural results and show that RPS games are convex, superadditive and totally balanced. Furthermore, the Shapley value can be computed in polynomial time. In addition to that, we provide a characterization of the core elements as a feasible flow in a network graph depending on the instance of the underlying RPSP. By using this characterization, a core element can be computed efficiently.
The instruction learning paradigm -- where a model learns to perform new tasks from task descriptions alone -- has become popular in general-purpose model research. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging. For instance, we find that our model, a fine-tuned T5-based text2text transformer, struggles with large regular languages, suggesting that less precise instructions are challenging for models. Additionally, instruction executions that require tracking longer contexts of prior steps are also more difficult. We use our findings to systematically construct a challenging instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our large transformer learns to correctly interpret only 65.6% of test instructions (with at least 90% accuracy), and 11%-24% of the instructions in out-of-distribution generalization settings. We propose Hard RegSet as a challenging instruction learning task, and a controlled environment for studying instruction learning.
Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.
In this paper we generalize Dillon's switching method to characterize the exact $c$-differential uniformity of functions constructed via this method. More precisely, we modify some PcN/APcN and other functions with known $c$-differential uniformity in a controllable number of coordinates to render more such functions. We present several applications of the method in constructing PcN and APcN functions with respect to all $c\neq 1$. As a byproduct, we generalize some result of [Y. Wu, N. Li, X. Zeng, {\em New PcN and APcN functions over finite fields}, Designs Codes Crypt. 89 (2021), 2637--2651]. Computational results rendering functions with low differential uniformity, as well as, other good cryptographic properties are sprinkled throughout the paper.
We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. The framework, called GenRL, trains deep policies by introducing an action latent variable such that the feed-forward policy search can be divided into two parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, and (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable. GenRL enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training on two robotics tasks: shooting a hockey puck and throwing a basketball. Furthermore, we empirically demonstrate that GenRL is the only method which can safely and efficiently solve the robotics tasks compared to two state-of-the-art RL methods.
Conventionally, generation of natural language for dialogue agents may be viewed as a statistical learning problem: determine the patterns in human-provided data and generate appropriate responses with similar statistical properties. However, dialogue can also be regarded as a goal directed process, where speakers attempt to accomplish a specific task. Reinforcement learning (RL) algorithms are designed specifically for solving such goal-directed problems, but the most direct way to apply RL -- through trial-and-error learning in human conversations, -- is costly. In this paper, we study how offline reinforcement learning can instead be used to train dialogue agents entirely using static datasets collected from human speakers. Our experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents that better accomplish task goals.
Given an increasing graph property $\cal F$, the strong Avoider-Avoider $\cal F$ game is played on the edge set of a complete graph. Two players, Red and Blue, take turns in claiming previously unclaimed edges with Red going first, and the player whose graph possesses $\cal F$ first loses the game. If the property $\cal F$ is "containing a fixed graph $H$", we refer to the game as the $H$ game. We prove that Blue has a winning strategy in two strong Avoider-Avoider games, $P_4$ game and ${\cal CC}_{>3}$ game, where ${\cal CC}_{>3}$ is the property of having at least one connected component on more than three vertices. We also study a variant, the strong CAvoider-CAvoider games, with additional requirement that the graph of each of the players must stay connected throughout the game. We prove that Blue has a winning strategy in the strong CAvoider-CAvoider games $S_3$ and $P_4$, as well as in the $Cycle$ game, where the players aim at avoiding all cycles.
The minimum energy path (MEP) describes the mechanism of reaction, and the energy barrier along the path can be used to calculate the reaction rate in thermal systems. The nudged elastic band (NEB) method is one of the most commonly used schemes to compute MEPs numerically. It approximates an MEP by a discrete set of configuration images, where the discretization size determines both computational cost and accuracy of the simulations. In this paper, we consider a discrete MEP to be a stationary state of the NEB method and prove an optimal convergence rate of the discrete MEP with respect to the number of images. Numerical simulations for the transitions of some several proto-typical model systems are performed to support the theory.
White noise is a fundamental and fairly well understood stochastic process that conforms the conceptual basis for many other processes, as well as for the modeling of time series. Here we push a fresh perspective toward white noise that, grounded on combinatorial considerations, contributes to give new interesting insights both for modelling and theoretical purposes. To this aim, we incorporate the ordinal pattern analysis approach which allows us to abstract a time series as a sequence of patterns and their associated permutations, and introduce a simple functional over permutations that partitions them into classes encoding their level of asymmetry. We compute the exact probability mass function (p.m.f.) of this functional over the symmetric group of degree $n$, thus providing the description for the case of an infinite white noise realization. This p.m.f. can be conveniently approximated by a continuous probability density from an exponential family, the Gaussian, hence providing natural sufficient statistics that render a convenient and simple statistical analysis through ordinal patterns. Such analysis is exemplified on experimental data for the spatial increments from tracks of gold nanoparticles in 3D diffusion.
This manuscript portrays optimization as a process. In many practical applications the environment is so complex that it is infeasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary as well as beneficial to take a robust approach, by applying an optimization method that learns as one goes along, learning from experience as more aspects of the problem are observed. This view of optimization as a process has become prominent in varied fields and has led to some spectacular success in modeling and systems that are now part of our daily lives.