Reinforcement Learning (RL) approaches are lately deployed for orchestrating wireless communications empowered by Reconfigurable Intelligent Surfaces (RISs), leveraging their online optimization capabilities. Most commonly, in RL-based formulations for realistic RISs with low resolution phase-tunable elements, each configuration is modeled as a distinct reflection action, resulting to inefficient exploration due to the exponential nature of the search space. In this paper, we consider RISs with 1-bit phase resolution elements, and model the action of each of them as a binary vector including the feasible reflection coefficients. We then introduce two variations of the well-established Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) agents, aiming for effective exploration of the binary action spaces. For the case of DQN, we make use of an efficient approximation of the Q-function, whereas a discretization post-processing step is applied to the output of DDPG. Our simulation results showcase that the proposed techniques greatly outperform the baseline in terms of the rate maximization objective, when large-scale RISs are considered. In addition, when dealing with moderate scale RIS sizes, where the conventional DQN based on configuration-based action spaces is feasible, the performance of the latter technique is similar to the proposed learning approach.
In this paper, we are concerned with the numerical solution for the two-dimensional time fractional Fokker-Planck equation with tempered fractional derivative of order $\alpha$. Although some of its variants are considered in many recent numerical analysis papers, there are still some significant differences. Here we first provide the regularity estimates of the solution. And then a modified $L$1 scheme inspired by the middle rectangle quadrature formula on graded meshes is employed to compensate for the singularity of the solution at $t\rightarrow 0^{+}$, while the five-point difference scheme is used in space. Stability and convergence are proved in the sence of $L^{\infty}$ norm, then a sharp error estimate $\mathscr{O}(\tau^{\min\{2-\alpha, r\alpha\}})$ is derived on graded meshes. Furthermore, unlike the bounds proved in the previous works, the constant multipliers in our analysis do not blow up as the Caputo fractional derivative $\alpha$ approaches the classical value of 1. Finally, we perform the numerical experiments to verify the effectiveness and convergence order of the presented algorithms.
This paper contains two major contributions. First we derive, following the discrete de Rham (DDR) and Virtual Element (VEM) paradigms, pressure-robust methods for the Stokes equations that support arbitrary orders and polyhedral meshes. Unlike other methods presented in the literature, pressure-robustness is achieved here without resorting to an $\boldsymbol{H}({\rm div})$-conforming construction on a submesh, but rather projecting the volumetric force onto the discrete $\boldsymbol{H}({\bf curl})$ space. The cancellation of the pressure error contribution stems from key commutation properties of the underlying DDR and VEM complexes. The pressure-robust error estimates in $h^{k+1}$ (with $h$ denoting the meshsize and $k\ge 0$ the polynomial degree of the DDR or VEM complex) are proven theoretically and supported by a panel of three-dimensional numerical tests. The second major contribution of the paper is an in-depth study of the relations between the DDR and VEM approaches. We show, in particular, that a complex developed following one paradigm admits a reformulation in the other, and that couples of related DDR and VEM complexes satisfy commuting diagram properties with the degrees of freedom maps.
Timed automata are a common formalism for the verification of concurrent systems subject to timing constraints. They extend finite-state automata with clocks, that constrain the system behavior in locations, and to take transitions. While timed automata were originally designed for safety (in the wide sense of correctness w.r.t. a formal property), they were progressively used in a number of works to guarantee security properties. In this work, we review works studying security properties for timed automata in the last two decades. We notably review theoretical works, with a particular focus on opacity, as well as more practical works, with a particular focus on attack trees and their extensions. We derive main conclusions concerning open perspectives, as well as tool support.
This paper focuses on improving the resource allocation algorithm in terms of packet delivery ratio (PDR), i.e., the number of successfully received packets sent by end devices (EDs) in a long-range wide-area network (LoRaWAN). Setting the transmission parameters significantly affects the PDR. Employing reinforcement learning (RL), we propose a resource allocation algorithm that enables the EDs to configure their transmission parameters in a distributed manner. We model the resource allocation problem as a multi-armed bandit (MAB) and then address it by proposing a two-phase algorithm named MIX-MAB, which consists of the exponential weights for exploration and exploitation (EXP3) and successive elimination (SE) algorithms. We evaluate the MIX-MAB performance through simulation results and compare it with other existing approaches. Numerical results show that the proposed solution performs better than the existing schemes in terms of convergence time and PDR.
Visible Light Communication (VLC) is one the most promising enabling technology for future 6G networks to overcome Radio-Frequency (RF)-based communication limitations thanks to a broader bandwidth, higher data rate, and greater efficiency. However, from the security perspective, VLCs suffer from all known wireless communication security threats (e.g., eavesdropping and integrity attacks). For this reason, security researchers are proposing innovative Physical Layer Security (PLS) solutions to protect such communication. Among the different solutions, the novel Reflective Intelligent Surface (RIS) technology coupled with VLCs has been successfully demonstrated in recent work to improve the VLC communication capacity. However, to date, the literature still lacks analysis and solutions to show the PLS capability of RIS-based VLC communication. In this paper, we combine watermarking and jamming primitives through the Watermark Blind Physical Layer Security (WBPLSec) algorithm to secure VLC communication at the physical layer. Our solution leverages RIS technology to improve the security properties of the communication. By using an optimization framework, we can calculate RIS phases to maximize the WBPLSec jamming interference schema over a predefined area in the room. In particular, compared to a scenario without RIS, our solution improves the performance in terms of secrecy capacity without any assumption about the adversary's location. We validate through numerical evaluations the positive impact of RIS-aided solution to increase the secrecy capacity of the legitimate jamming receiver in a VLC indoor scenario. Our results show that the introduction of RIS technology extends the area where secure communication occurs and that by increasing the number of RIS elements the outage probability decreases.
We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.
Bilevel optimization has arisen as a powerful tool in modern machine learning. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations, which can be costly and unscalable in practice. Recently, Hessian-free bilevel schemes have been proposed to resolve this issue, where the general idea is to use zeroth- or first-order methods to approximate the full hypergradient of the bilevel problem. However, we empirically observe that such approximation can lead to large variance and unstable training, but estimating only the response Jacobian matrix as a partial component of the hypergradient turns out to be extremely effective. To this end, we propose a new Hessian-free method, which adopts the zeroth-order-like method to approximate the response Jacobian matrix via taking difference between two optimization paths. Theoretically, we provide the convergence rate analysis for the proposed algorithms, where our key challenge is to characterize the approximation and smoothness properties of the trajectory-dependent estimator, which can be of independent interest. This is the first known convergence rate result for this type of Hessian-free bilevel algorithms. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning with ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.
This paper investigates the problem of regret minimization in linear time-varying (LTV) dynamical systems. Due to the simultaneous presence of uncertainty and non-stationarity, designing online control algorithms for unknown LTV systems remains a challenging task. At a cost of NP-hard offline planning, prior works have introduced online convex optimization algorithms, although they suffer from nonparametric rate of regret. In this paper, we propose the first computationally tractable online algorithm with regret guarantees that avoids offline planning over the state linear feedback policies. Our algorithm is based on the optimism in the face of uncertainty (OFU) principle in which we optimistically select the best model in a high confidence region. Our algorithm is then more explorative when compared to previous approaches. To overcome non-stationarity, we propose either a restarting strategy (R-OFU) or a sliding window (SW-OFU) strategy. With proper configuration, our algorithm is attains sublinear regret $O(T^{2/3})$. These algorithms utilize data from the current phase for tracking variations on the system dynamics. We corroborate our theoretical findings with numerical experiments, which highlight the effectiveness of our methods. To the best of our knowledge, our study establishes the first model-based online algorithm with regret guarantees under LTV dynamical systems.
The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments. These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling preferences instead as arising from a different statistic: each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences. We also prove that the previous partial return model lacks this identifiability property without preference noise that reveals rewards' relative proportions, and we empirically show that our proposed regret preference model outperforms it with finite training data in otherwise the same setting. Additionally, our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research.
We study the dynamics of (synchronous) one-dimensional cellular automata with cyclical boundary conditions that evolve according to the majority rule with radius $ r $. We introduce a notion that we term cell stability with which we express the structure of the possible configurations that could emerge in this setting. Our main finding is that apart from the configurations of the form $ (0^{r+1}0^* + 1^{r+1}1^*)^* $, which are always fixed-points, the other configurations that the automata could possibly converge to, which are known to be either fixed-points or 2-cycles, have a particular spatially periodic structure. Namely, each of these configurations is of the form $ s^* $ where $ s $ consists of $ O(r^2) $ consecutive sequences of cells with the same state, each such sequence is of length at most $ r $, and the total length of $ s $ is $ O(r^2) $ as well. We show that an analogous result also holds for the minority rule.