Online allocation problems with resource constraints are central problems in revenue management and online advertising. In these problems, requests arrive sequentially during a finite horizon and, for each request, a decision maker needs to choose an action that consumes a certain amount of resources and generates reward. The objective is to maximize cumulative rewards subject to a constraint on the total consumption of resources. In this paper, we consider a data-driven setting in which the reward and resource consumption of each request are generated using an input model that is unknown to the decision maker. We design a general class of algorithms that attain good performance in various input models without knowing which type of input they are facing. In particular, our algorithms are asymptotically optimal under independent and identically distributed inputs as well as various non-stationary stochastic input models, and they attain an asymptotically optimal fixed competitive ratio when the input is adversarial. Our algorithms operate in the Lagrangian dual space: they maintain a dual multiplier for each resource that is updated using online mirror descent. By choosing the reference function accordingly, we recover the dual sub-gradient descent and dual multiplicative weights update algorithm. The resulting algorithms are simple, fast, and do not require convexity in the revenue function, consumption function and action space, in contrast to existing methods for online allocation problems. We discuss applications to network revenue management, online bidding in repeated auctions with budget constraints, online proportional matching with high entropy, and personalized assortment optimization with limited inventory.
Power allocation is one of the fundamental problems in wireless networks and a wide variety of algorithms address this problem from different perspectives. A common element among these algorithms is that they rely on an estimation of the channel state, which may be inaccurate on account of hardware defects, noisy feedback systems, and environmental and adversarial disturbances. Therefore, it is essential that the output power allocation of these algorithms is stable with respect to input perturbations, to the extent that the variations in the output are bounded for bounded variations in the input. In this paper, we focus on UWMMSE -- a modern algorithm leveraging graph neural networks --, and illustrate its stability to additive input perturbations of bounded energy through both theoretical analysis and empirical validation.
We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the system's expected performance. An SBOS problem compares different systems based on their expected performances under their own optimally chosen decision to select the best, without advance knowledge of expected performances of the systems nor the optimizing decision inside each system. We design easy-to-implement algorithms that adaptively chooses a system and a choice of decision to evaluate the noisy system performance, sequentially eliminates inferior systems, and eventually recommends a system as the best after spending a user-specified budget. The proposed algorithms integrate the stochastic gradient descent method and the sequential elimination method to simultaneously exploit the structure inside each system and make comparisons across systems. For the proposed algorithms, we prove exponential rates of convergence to zero for the probability of false selection, as the budget grows to infinity. We conduct three numerical examples that represent three practical cases of SBOS problems. Our proposed algorithms demonstrate consistent and stronger performances in terms of the probability of false selection over benchmark algorithms under a range of problem settings and sampling budgets.
We consider Broyden's method and some accelerated schemes for nonlinear equations having a strongly regular singularity of first order with a one-dimensional nullspace. Our two main results are as follows. First, we show that the use of a preceding Newton-like step ensures convergence for starting points in a starlike domain with density 1. This extends the domain of convergence of these methods significantly. Second, we establish that the matrix updates of Broyden's method converge q-linearly with the same asymptotic factor as the iterates. This contributes to the long-standing question whether the Broyden matrices converge by showing that this is indeed the case for the setting at hand. Furthermore, we prove that the Broyden directions violate uniform linear independence, which implies that existing results for convergence of the Broyden matrices cannot be applied. Numerical experiments of high precision confirm the enlarged domain of convergence, the q-linear convergence of the matrix updates, and the lack of uniform linear independence. In addition, they suggest that these results can be extended to singularities of higher order and that Broyden's method can converge r-linearly without converging q-linearly. The underlying code is freely available.
Nash equilibrium is a central concept in game theory. Several Nash solvers exist, yet none scale to normal-form games with many actions and many players, especially those with payoff tensors too big to be stored in memory. In this work, we propose an approach that iteratively improves an approximation to a Nash equilibrium through joint play. It accomplishes this by tracing a previously established homotopy that defines a continuum of equilibria for the game regularized with decaying levels of entropy. This continuum asymptotically approaches the limiting logit equilibrium, proven by McKelvey and Palfrey (1995) to be unique in almost all games, thereby partially circumventing the well-known equilibrium selection problem of many-player games. To encourage iterates to remain near this path, we efficiently minimize average deviation incentive via stochastic gradient descent, intelligently sampling entries in the payoff tensor as needed. Monte Carlo estimates of the stochastic gradient from joint play are biased due to the appearance of a nonlinear max operator in the objective, so we introduce additional innovations to the algorithm to alleviate gradient bias. The descent process can also be viewed as repeatedly constructing and reacting to a polymatrix approximation to the game. In these ways, our proposed approach, average deviation incentive descent with adaptive sampling (ADIDAS), is most similar to three classical approaches, namely homotopy-type, Lyapunov, and iterative polymatrix solvers. The lack of local convergence guarantees for biased gradient descent prevents guaranteed convergence to Nash, however, we demonstrate through extensive experiments the ability of this approach to approximate a unique Nash in normal-form games with as many as seven players and twenty one actions (several billion outcomes) that are orders of magnitude larger than those possible with prior algorithms.
We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems based on unified adaptive matrices, which include almost existing coordinate-wise and global adaptive learning rates. Specifically, we propose a fast Adaptive Gradient Decent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. At the same time, we present an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $O(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method reaches the best known gradient complexity of $O(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Some experimental results on policy evaluation and fair classifier tasks demonstrate the efficiency of our algorithms.
By executing offloaded tasks from mobile users, edge computing augments mobile user equipments (UEs) with computing/communications resources from edge nodes (ENs), enabling new services (e.g., real-time gaming). However, despite being more resourceful than UEs, allocating ENs' resources to a given favorable set of users (e.g., closer to ENs) may block other UEs from their services. This is often the case for most existing approaches that only aim to maximize the network social welfare or minimize the total energy consumption but do not consider the computing/battery status of each UE. This work develops an energy-based proportional-fair framework to serve all users with multiple tasks while considering both their service requirements and energy/battery levels in a multi-layer edge network. The resulting problem for offloading tasks and allocating resources toward the tasks is a Mixed-Integer Nonlinear Programming, which is NP-hard. To tackle it, we leverage the fact that the relaxed problem is convex and propose a distributed algorithm, namely the dynamic branch-and-bound Benders decomposition (DBBD). DBBD decomposes the original problem into a master problem (MP) for the offloading decisions and multiple subproblems (SPs) for resource allocation. To quickly eliminate inefficient offloading solutions, MP is integrated with powerful Benders cuts exploiting the ENs' resource constraints. We then develop a dynamic branch-and-bound algorithm (DBB) to efficiently solve MP considering the load balance among ENs. SPs can either be solved for their closed-form solutions or be solved in parallel at ENs, thus reducing the complexity. The numerical results show that DBBD returns the optimal solution in maximizing the proportional fairness among UEs. DBBD has higher fairness indexes, i.e., Jain's index and min-max ratio, in comparison with the existing ones that minimize the total consumed energy.
Many-user MAC is an important model for understanding energy efficiency of massive random access in 5G and beyond. Introduced in Polyanskiy'2017 for the AWGN channel, subsequent works have provided improved bounds on the asymptotic minimum energy-per-bit required to achieve a target per-user error at a given user density and payload, going beyond the AWGN setting. The best known rigorous bounds use spatially coupled codes along with the optimal AMP algorithm. But these bounds are infeasible to compute beyond a few (around 10) bits of payload. In this paper, we provide new achievability bounds for the many-user AWGN and quasi-static Rayleigh fading MACs using the spatially coupled codebook design along with a scalar AMP algorithm. The obtained bounds are computable even up to 100 bits and outperform the previous ones at this payload.
This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.