亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

In this work, we consider the problem of jointly minimizing the average cost of sampling and transmitting status updates by users over a wireless channel subject to average Age of Information (AoI) constraints. Errors in the transmission may occur and a scheduling policy has to decide if the users sample a new packet or attempt for retransmission of the packet sampled previously. The cost consists of both sampling and transmission costs. The sampling of a new packet after a failure imposes an additional cost on the system. We formulate a stochastic optimization problem with the average cost in the objective under average AoI constraints. To solve this problem, we propose three scheduling policies; a) a dynamic policy, that is centralized and requires full knowledge of the state of the system, b) two stationary randomized policies that require no knowledge of the state of the system. We utilize tools from Lyapunov optimization theory in order to provide the dynamic policy, and we prove that its solution is arbitrary close to the optimal one. In order to provide the randomized policies, we model the system by utilizing Discrete Time Markov Chain (DTMC). We provide closed-form and approximated expressions for the average AoI and its distribution, for each randomized policy. Simulation results show the importance of providing the option to transmit an old packet in order to minimize the total average cost.

相關內容

Peridynamic (PD) theory is significant and promising in engineering and materials science; however, it imposes challenges owing to the enormous computational cost caused by its nonlocality. Our main contribution, which overcomes the restrictions of the existing fast method, is a general computational framework for the linear bond-based peridynamic models based on the meshfree method, called the matrix-structure-based fast method (MSBFM), which is suitable for the general case, including 2D/3D problems, and static/dynamic issues, as well as problems with general boundary conditions, in particular, problems with crack propagation. Consequently, we provide a general calculation flow chart. The proposed computational framework is practical and easily embedded into the existing computational algorithm. With this framework, the computational cost is reduced from $O(N^2)$ to $O(N\log N)$, and the storage request is reduced from $O(N^2)$ to $O(N)$, where N is the degree of freedom. Finally, the vast reduction of the computational and memory requirement is verified by numerical examples.

In this paper, we study a distributed privacy-preserving learning problem in social networks with general topology. The agents can communicate with each other over the network, which may result in privacy disclosure, since the trustworthiness of the agents cannot be guaranteed. Given a set of options which yield unknown stochastic rewards, each agent is required to learn the best one, aiming at maximizing the resulting expected average cumulative reward. To serve the above goal, we propose a four-staged distributed algorithm which efficiently exploits the collaboration among the agents while preserving the local privacy for each of them. In particular, our algorithm proceeds iteratively, and in every round, each agent i) randomly perturbs its adoption for the privacy-preserving purpose, ii) disseminates the perturbed adoption over the social network in a nearly uniform manner through random walking, iii) selects an option by referring to the perturbed suggestions received from its peers, and iv) decides whether or not to adopt the selected option as preference according to its latest reward feedback. Through solid theoretical analysis, we quantify the trade-off among the number of agents (or communication overhead), privacy preserving and learning utility. We also perform extensive simulations to verify the efficacy of our proposed social learning algorithm.

Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.

Proactive edge association is capable of improving wireless connectivity at the cost of increased handover (HO) frequency and energy consumption, while relying on a large amount of private information sharing required for decision making. In order to improve the connectivity-cost trade-off without privacy leakage, we investigate the privacy-preserving joint edge association and power allocation (JEAPA) problem in the face of the environmental uncertainty and the infeasibility of individual learning. Upon modelling the problem by a decentralized partially observable Markov Decision Process (Dec-POMDP), it is solved by federated multi-agent reinforcement learning (FMARL) through only sharing encrypted training data for federatively learning the policy sought. Our simulation results show that the proposed solution strikes a compelling trade-off, while preserving a higher privacy level than the state-of-the-art solutions.

This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.

The Age of Incorrect Information (AoII) is a metric that can combine the freshness of the information available to a gateway in an Internet of Things (IoT) network with the accuracy of that information. As such, minimizing the AoII can allow the operators of IoT systems to have a more precise and up-to-date picture of the environment in which the sensors are deployed. However, most IoT systems do not allow for centralized scheduling or explicit coordination, as sensors need to be extremely simple and consume as little power as possible. Finding a decentralized policy to minimize the AoII can be extremely challenging in this setting. This paper presents a heuristic to optimize AoII for a slotted ALOHA system, starting from a threshold-based policy and using dual methods to converge to a better solution. This method can significantly outperform state-independent policies, finding an efficient balance between frequent updates and a low number of packet collisions.

This work studies efficient solution methods for cluster-based control policies of transition-independent Markov decision processes (TI-MDPs). We focus on control of multi-agent systems, whereby a central planner (CP) influences agents to select desirable group behavior. The agents are partitioned into disjoint clusters whereby agents in the same cluster receive the same controls but agents in different clusters may receive different controls. Under mild assumptions, this process can be modeled as a TI-MDP where each factor describes the behavior of one cluster. The action space of the TI-MDP becomes exponential with respect to the number of clusters. To efficiently find a policy in this rapidly scaling space, we propose a clustered Bellman operator that optimizes over the action space for one cluster at any evaluation. We present Clustered Value Iteration (CVI), which uses this operator to iteratively perform "round robin" optimization across the clusters. CVI converges exponentially faster than standard value iteration (VI), and can find policies that closely approximate the MDP's true optimal value. A special class of TI-MDPs with separable reward functions are investigated, and it is shown that CVI will find optimal policies on this class of problems. Finally, the optimal clustering assignment problem is explored. The value functions TI-MDPs with submodular reward functions are shown to be submodular functions, so submodular set optimization may be used to find a near optimal clustering assignment. We propose an iterative greedy cluster splitting algorithm, which yields monotonic submodular improvement in value at each iteration. Finally, simulations offer empirical assessment of the proposed methods.

To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.

Consider robot swarm wireless networks where mobile robots offload their computing tasks to a computing server located at the mobile edge. Our aim is to maximize the swarm lifetime through efficient exploitation of the correlation between distributed data sources. The optimization problem is handled by selecting appropriate robot subsets to send their sensed data to the server. In this work, the data correlation between distributed robot subsets is modelled as an undirected graph. A least-degree iterative partitioning (LDIP) algorithm is proposed to partition the graph into a set of subgraphs. Each subgraph has at least one vertex (i.e., subset), termed representative vertex (R-Vertex), which shares edges with and only with all other vertices within the subgraph; only R-Vertices are selected for data transmissions. When the number of subgraphs is maximized, the proposed subset selection approach is shown to be optimum in the AWGN channel. For independent fading channels, the max-min principle can be incorporated into the proposed approach to achieve the best performance.

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

北京阿比特科技有限公司