In this paper consider a two user multiple access channel with noisy feedback. There are two senders with independent messages who transmit symbols across an additive white Gaussian channel to a receiver, who in turn sends back a symbol which is received by the two senders through two independent noisy Gaussian channels. We consider the case when the feedback is active i.e. the receiver actively encodes the feedback using a linear state process. We pose this as a problem of linear sequential coding at the senders and the receiver to minimize the terminal mean square probability of error at the receiver. This is an instance of decentralized control with no common information at the senders and the receiver. In this paper, we construct two linear controllers at the sender and the receiver. Due to linearity of the policies and the controllers, all the random variables involved are jointly Gaussian. Moreover, the corresponding covariance matrix at the receiver of the estimation process of the senders' messages is a deterministic process, which is a function of the parameters of the controllers and the strategies of the players, and is thus perfectly observed by the senders. Based on this observation, we use deterministic dynamic programming to find the optimal policies and the optimal linear controllers at both the senders and the receiver. The problem with passive feedback can be considered as a special case.
We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.
Active Queue Management (AQM) aims to prevent bufferbloat and serial drops in router and switch FIFO packet buffers that usually employ drop-tail queueing. AQM describes methods to send proactive feedback to TCP flow sources to regulate their rate using selective packet drops or markings. Traditionally, AQM policies relied on heuristics to approximately provide Quality of Service (QoS) such as a target delay for a given flow. These heuristics are usually based on simple network and TCP control models together with the monitored buffer filling. A primary drawback of these heuristics is that their way of accounting flow characteristics into the feedback mechanism and the corresponding effect on the state of congestion are not well understood. In this work, we show that taking a probabilistic model for the flow rates and the dequeueing pattern, a Semi-Markov Decision Process (SMDP) can be formulated to obtain an optimal packet dropping policy. This policy-based AQM, denoted PAQMAN, takes into account a steady-state model of TCP and a target delay for the flows. Additionally, we present an inference algorithm that builds on TCP congestion control in order to calibrate the model parameters governing underlying network conditions. Finally, we evaluate the performance of our approach using simulation compared to state-of-the-art AQM algorithms.
In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}}$ for the communication volume of the Cholesky factorization of an $N\times N$ symmetric positive definite matrix, and of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}}$ for the SYRK computation of $\mat{A}\cdot\transpose{\mat{A}}$, where $\mathbf{A}$ is an $N\times M$ matrix. Both bounds improve the best known lower bounds from the literature by a factor $\sqrt{2}$.In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} + \bigo{NM\log N}$, and \LBC for Cholesky, with a volume of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} + \bigo{N^{5/2}}$. Both algorithms improve over the best known algorithms from the literature by a factor $\sqrt{2}$, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor $\sqrt{2}$) than that of corresponding non-symmetric kernels (GEMM and LU factorization).
In this paper, a safe and learning-based control framework for model predictive control (MPC) is proposed to optimize nonlinear systems with a non-differentiable objective function under uncertain environmental disturbances. The control framework integrates a learning-based MPC with an auxiliary controller in a way of minimal intervention. The learning-based MPC augments the prior nominal model with incremental Gaussian Processes to learn the uncertain disturbances. The cross-entropy method (CEM) is utilized as the sampling-based optimizer for the MPC with a non-differentiable objective function. A minimal intervention controller is devised with a control Lyapunov function and a control barrier function to guide the sampling process and endow the system with high probabilistic safety. The proposed algorithm shows a safe and adaptive control performance on a simulated quadrotor in the tasks of trajectory tracking and obstacle avoidance under uncertain wind disturbances.
As a class of state-dependent channels, Markov channels have been long studied in information theory for characterizing the feedback capacity and error exponent. This paper studies a more general variant of such channels where the state evolves via a general stochastic process, not necessarily Markov or ergodic. The states are assumed to be unknown to the transmitter and the receiver, but the underlying probability distributions are known. For this setup, we derive an upper bound on the feedback error exponent and the feedback capacity with variable-length codes. The bounds are expressed in terms of the directed mutual information and directed relative entropy. The bounds on the error exponent are simplified to Burnashev's expression for discrete memoryless channels. Our method relies on tools from the theory of martingales to analyze a stochastic process defined based on the entropy of the message given the past channel's outputs.
In this paper, we propose a constructive interference (CI)-based block-level precoding (CI-BLP) approach for the downlink of a multi-user multiple-input single-output (MU-MISO) communication system. Contrary to existing CI precoding approaches which have to be designed on a symbol-by-symbol level, here a constant precoding matrix is applied to a block of symbol slots within a channel coherence interval, thus significantly reducing the computational costs over traditional CI-based symbol-level precoding (CI-SLP) as the CI-BLP optimization problem only needs to be solved once per block. For both PSK and QAM modulation, we formulate an optimization problem to maximize the minimum CI effect over the block subject to a block- rather than symbol-level power budget. We mathematically derive the optimal precoding matrix for CI-BLP as a function of the Lagrange multipliers in closed form. By formulating the dual problem, the original CI-BLP optimization problem is further shown to be equivalent to a quadratic programming (QP) optimization. Numerical results validate our derivations, and show that the proposed CI-BLP scheme achieves improved performance over the traditional CI-SLP method, thanks to the relaxed power constraint over the considered block of symbol slots.
This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
We determine the exact minimax rate of a Gaussian sequence model under bounded convex constraints, purely in terms of the local geometry of the given constraint set $K$. Our main result shows that the minimax risk (up to constant factors) under the squared $L_2$ loss is given by $\epsilon^{*2} \wedge \operatorname{diam}(K)^2$ with \begin{align*} \epsilon^* = \sup \bigg\{\epsilon : \frac{\epsilon^2}{\sigma^2} \leq \log M^{\operatorname{loc}}(\epsilon)\bigg\}, \end{align*} where $\log M^{\operatorname{loc}}(\epsilon)$ denotes the local entropy of the set $K$, and $\sigma^2$ is the variance of the noise. We utilize our abstract result to re-derive known minimax rates for some special sets $K$ such as hyperrectangles, ellipses, and more generally quadratically convex orthosymmetric sets. Finally, we extend our results to the unbounded case with known $\sigma^2$ to show that the minimax rate in that case is $\epsilon^{*2}$.
Discovering causal structure among a set of variables is a fundamental problem in many empirical sciences. Traditional score-based casual discovery methods rely on various local heuristics to search for a Directed Acyclic Graph (DAG) according to a predefined score function. While these methods, e.g., greedy equivalence search, may have attractive results with infinite samples and certain model assumptions, they are usually less satisfactory in practice due to finite data and possible violation of assumptions. Motivated by recent advances in neural combinatorial optimization, we propose to use Reinforcement Learning (RL) to search for the DAG with the best scoring. Our encoder-decoder model takes observable data as input and generates graph adjacency matrices that are used to compute rewards. The reward incorporates both the predefined score function and two penalty terms for enforcing acyclicity. In contrast with typical RL applications where the goal is to learn a policy, we use RL as a search strategy and our final output would be the graph, among all graphs generated during training, that achieves the best reward. We conduct experiments on both synthetic and real datasets, and show that the proposed approach not only has an improved search ability but also allows a flexible score function under the acyclicity constraint.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.