Consider the problem of power control for an energy harvesting communication system, where the transmitter is equipped with a finite-sized rechargeable battery and is able to look ahead to observe a fixed number of future energy arrivals. An implicit characterization of the maximum average throughput over an additive white Gaussian noise channel and the associated optimal power control policy is provided via the Bellman equation under the assumption that the energy arrival process is stationary and memoryless. A more explicit characterization is obtained for the case of Bernoulli energy arrivals by means of asymptotically tight upper and lower bounds on both the maximum average throughput and the optimal power control policy. Apart from their pivotal role in deriving the desired analytical results, such bounds are highly valuable from a numerical perspective as they can be efficiently computed using convex optimization solvers.
This paper develops a novel control-theoretic framework to analyze the non-asymptotic convergence of Q-learning. We show that the dynamics of asynchronous Q-learning with a constant step-size can be naturally formulated as a discrete-time stochastic affine switching system. Moreover, the evolution of the Q-learning estimation error is over- and underestimated by trajectories of two simpler dynamical systems. Based on these two systems, we derive a new finite-time error bound of asynchronous Q-learning when a constant stepsize is used. Our analysis also sheds light on the overestimation phenomenon of Q-learning. We further illustrate and validate the analysis through numerical simulations.
We develop an essentially optimal finite element approach for solving ergodic stochastic two-scale elliptic equations whose two-scale coefficient may depend also on the slow variable. We solve the limiting stochastic two-scale homogenized equation obtained from the stochastic two-scale convergence in the mean (A. Bourgeat, A. Mikelic and S. Wright, J. reine angew. Math, Vol. 456, 1994), whose solution comprises of the solution to the homogenized equation and the corrector, by truncating the infinite domain of the fast variable and using the sparse tensor product finite elements. We show that the convergence rate in terms of the truncation level is equivalent to that for solving the cell problems in the same truncated domain. Solving this equation, we obtain the solution to the homogenized equation and the corrector at the same time, using only a number of degrees of freedom that is essentially equivalent to that required for solving one cell problem. Optimal complexity is obtained when the corrector possesses sufficient regularity with respect to both the fast and the slow variables. Although the regularity norm of the corrector depends on the size of the truncated domain, we show that the convergence rate of the approximation for the solution to the homogenized equation is independent of the size of the truncated domain. With the availability of an analytic corrector, we construct a numerical corrector for the solution of the original stochastic two-scale equation from the finite element solution to the truncated stochastic two-scale homogenized equation. Numerical examples of quasi-periodic two-scale equations, and a stochastic two-scale equation of the checker board type, whose coefficient is discontinuous, confirm the theoretical results.
This paper investigates the adaptive trajectory and communication scheduling design for an unmanned aerial vehicle (UAV) relaying random data traffic generated by ground nodes to a base station. The goal is to minimize the expected average communication delay to serve requests, subject to an average UAV mobility power constraint. It is shown that the problem can be cast as a semi-Markov decision process with a two-scale structure, which is optimized efficiently: in the outer decision, the UAV radial velocity for waiting phases and end radius for communication phases optimize the average long-term delay-power trade-off; given outer decisions, inner decisions greedily minimize the instantaneous delay-power cost, yielding the optimal angular velocity in waiting states, and the optimal relay strategy and UAV trajectory in communication states. A constrained particle swarm optimization algorithm is designed to optimize these trajectory problems, demonstrating 100x faster computational speeds than successive convex approximation methods. Simulations demonstrate that an intelligent adaptive design exploiting realistic UAV mobility features, such as helicopter translational lift, reduces the average communication delay and UAV mobility power consumption by 44% and 7%, respectively, with respect to an optimal hovering strategy and by 2% and 13%, respectively, with respect to a greedy delay minimization scheme.
Estimating and reacting to external disturbances is of fundamental importance for robust control of quadrotors. Existing estimators typically require significant tuning or training with a large amount of data, including the ground truth, to achieve satisfactory performance. This paper proposes a data-efficient differentiable moving horizon estimation (DMHE) algorithm that can automatically tune the MHE parameters online and also adapt to different scenarios. We achieve this by deriving the analytical gradient of the estimated trajectory from MHE with respect to the tuning parameters, enabling end-to-end learning for auto-tuning. Most interestingly, we show that the gradient can be calculated efficiently from a Kalman filter in a recursive form. Moreover, we develop a model-based policy gradient algorithm to learn the parameters directly from the trajectory tracking errors without the need for the ground truth. The proposed DMHE can be further embedded as a layer with other neural networks for joint optimization. Finally, we demonstrate the effectiveness of the proposed method via both simulation and experiments on quadrotors, where challenging scenarios such as sudden payload change and flying in downwash are examined.
In backward error analysis, an approximate solution to an equation is compared to the exact solution to a nearby "modified" equation. In numerical ordinary differential equations, the two agree up to any power of the step size. If the differential equation has a geometric property then the modified equation may share it. In this way, known properties of differential equations can be applied to the approximation. But for partial differential equations, the known modified equations are of higher order, limiting applicability of the theory. Therefore, we study symmetric solutions of discretized partial differential equations that arise from a discrete variational principle. These symmetric solutions obey infinite-dimensional functional equations. We show that these equations admit second-order modified equations which are Hamiltonian and also possess first-order Lagrangians in modified coordinates. The modified equation and its associated structures are computed explicitly for the case of rotating travelling waves in the nonlinear wave equation.
We devise a cooperative planning framework to generate optimal trajectories for a tethered robot duo, who is tasked to gather scattered objects spread in a large area using a flexible net. Specifically, the proposed planning framework first produces a set of dense waypoints for each robot, serving as the initialization for optimization. Next, we formulate an iterative optimization scheme to generate smooth and collision-free trajectories while ensuring cooperation within the robot duo to efficiently gather objects and properly avoid obstacles. We validate the generated trajectories in simulation and implement them in physical robots using Model Reference Adaptive Controller (MRAC) to handle unknown dynamics of carried payloads. In a series of studies, we find that: (i) a U-shape cost function is effective in planning cooperative robot duo, and (ii) the task efficiency is not always proportional to the tethered net's length. Given an environment configuration, our framework can gauge the optimal net length. To our best knowledge, ours is the first that provides such estimation for tethered robot duo.
The current paper studies the problem of minimizing a loss $f(\boldsymbol{x})$ subject to constraints of the form $\boldsymbol{D}\boldsymbol{x} \in S$, where $S$ is a closed set, convex or not, and $\boldsymbol{D}$ is a matrix that fuses parameters. Fusion constraints can capture smoothness, sparsity, or more general constraint patterns. To tackle this generic class of problems, we combine the Beltrami-Courant penalty method with the proximal distance principle. The latter is driven by minimization of penalized objectives $f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$ involving large tuning constants $\rho$ and the squared Euclidean distance of $\boldsymbol{D}\boldsymbol{x}$ from $S$. The next iterate $\boldsymbol{x}_{n+1}$ of the corresponding proximal distance algorithm is constructed from the current iterate $\boldsymbol{x}_n$ by minimizing the majorizing surrogate function $f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$. For fixed $\rho$ and a subanalytic loss $f(\boldsymbol{x})$ and a subanalytic constraint set $S$, we prove convergence to a stationary point. Under stronger assumptions, we provide convergence rates and demonstrate linear local convergence. We also construct a steepest descent (SD) variant to avoid costly linear system solves. To benchmark our algorithms, we compare against the alternating direction method of multipliers (ADMM). Our extensive numerical tests include problems on metric projection, convex regression, convex clustering, total variation image denoising, and projection of a matrix to a good condition number. These experiments demonstrate the superior speed and acceptable accuracy of our steepest variant on high-dimensional problems.
In many iterative optimization methods, fixed-point theory enables the analysis of the convergence rate via the contraction factor associated with the linear approximation of the fixed-point operator. While this factor characterizes the asymptotic linear rate of convergence, it does not explain the non-linear behavior of these algorithms in the non-asymptotic regime. In this letter, we take into account the effect of the first-order approximation error and present a closed-form bound on the convergence in terms of the number of iterations required for the distance between the iterate and the limit point to reach an arbitrarily small fraction of the initial distance. Our bound includes two terms: one corresponds to the number of iterations required for the linearized version of the fixed-point operator and the other corresponds to the overhead associated with the approximation error. With a focus on the convergence in the scalar case, the tightness of the proposed bound is proven for positively quadratic first-order difference equations.
Sub-optimal control policies in intersection traffic signal controllers (TSC) contribute to congestion and lead to negative effects on human health and the environment. Reinforcement learning (RL) for traffic signal control is a promising approach to design better control policies and has attracted considerable research interest in recent years. However, most work done in this area used simplified simulation environments of traffic scenarios to train RL-based TSC. To deploy RL in real-world traffic systems, the gap between simplified simulation environments and real-world applications has to be closed. Therefore, we propose LemgoRL, a benchmark tool to train RL agents as TSC in a realistic simulation environment of Lemgo, a medium-sized town in Germany. In addition to the realistic simulation model, LemgoRL encompasses a traffic signal logic unit that ensures compliance with all regulatory and safety requirements. LemgoRL offers the same interface as the wellknown OpenAI gym toolkit to enable easy deployment in existing research work. To demonstrate the functionality and applicability of LemgoRL, we train a state-of-the-art Deep RL algorithm on a CPU cluster utilizing a framework for distributed and parallel RL and compare its performance with other methods. Our benchmark tool drives the development of RL algorithms towards real-world applications.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.