The numerous deployed Artificial Intelligence systems need to be aligned with our ethical considerations. However, such ethical considerations might change as time passes: our society is not fixed, and our social mores evolve. This makes it difficult for these AI systems; in the Machine Ethics field especially, it has remained an under-studied challenge. In this paper, we present two algorithms, named QSOM and QDSOM, which are able to adapt to changes in the environment, and especially in the reward function, which represents the ethical considerations that we want these systems to be aligned with. They associate the well-known Q-Table to (Dynamic) Self-Organizing Maps to handle the continuous and multi-dimensional state and action spaces. We evaluate them on a use-case of multi-agent energy repartition within a small Smart Grid neighborhood, and prove their ability to adapt, and their higher performance compared to baseline Reinforcement Learning algorithms.
In this paper, we propose the global quaternion full orthogonalization (Gl-QFOM) and global quaternion generalized minimum residual (Gl-QGMRES) methods, which are built upon global orthogonal and oblique projections onto a quaternion matrix Krylov subspace, for solving quaternion linear systems with multiple right-hand sides. We first develop the global quaternion Arnoldi procedure to preserve the quaternion Hessenberg form during the iterations. We then establish the convergence analysis of the proposed methods, and show how to apply them to solve the Sylvester quaternion matrix equation. Numerical examples are provided to illustrate the effectiveness of our methods compared with the traditional Gl-FOM and Gl-GMRES iterations for the real representations of the original linear systems.
We present new Dirichlet-Neumann and Neumann-Dirichlet algorithms with a time domain decomposition applied to unconstrained parabolic optimal control problems. After a spatial semi-discretization, we use the Lagrange multiplier approach to derive a coupled forward-backward optimality system, which can then be solved using a time domain decomposition. Due to the forward-backward structure of the optimality system, three variants can be found for the Dirichlet-Neumann and Neumann-Dirichlet algorithms. We analyze their convergence behavior and determine the optimal relaxation parameter for each algorithm. Our analysis reveals that the most natural algorithms are actually only good smoothers, and there are better choices which lead to efficient solvers. We illustrate our analysis with numerical experiments.
Deep neural networks (NNs) are known for their high-prediction performances. However, NNs are prone to yield unreliable predictions when encountering completely new situations without indicating their uncertainty. Bayesian variants of NNs (BNNs), such as Monte Carlo (MC) dropout BNNs, do provide uncertainty measures and simultaneously increase the prediction performance. The only disadvantage of BNNs is their higher computation time during test time because they rely on a sampling approach. Here we present a single-shot MC dropout approximation that preserves the advantages of BNNs while being as fast as NNs. Our approach is based on moment propagation (MP) and allows to analytically approximate the expected value and the variance of the MC dropout signal for commonly used layers in NNs, i.e. convolution, max pooling, dense, softmax, and dropout layers. The MP approach can convert an NN into a BNN without re-training given the NN has been trained with standard dropout. We evaluate our approach on different benchmark datasets and a simulated toy example in a classification and regression setting. We demonstrate that our single-shot MC dropout approximation resembles the point estimate and the uncertainty estimate of the predictive distribution that is achieved with an MC approach, while being fast enough for real-time deployments of BNNs. We show that using part of the saved time to combine our MP approach with deep ensemble techniques does further improve the uncertainty measures.
We bring in here a novel algebraic approach for attacking the McEliece cryptosystem. It consists in introducing a subspace of matrices representing quadratic forms. Those are associated with quadratic relationships for the component-wise product in the dual of the code used in the cryptosystem. Depending on the characteristic of the code field, this space of matrices consists only of symmetric matrices or skew-symmetric matrices. This matrix space is shown to contain unusually low-rank matrices (rank $2$ or $3$ depending on the characteristic) which reveal the secret polynomial structure of the code. Finding such matrices can then be used to recover the secret key of the scheme. We devise a dedicated approach in characteristic $2$ consisting in using a Gr\"obner basis modeling that a skew-symmetric matrix is of rank $2$. This allows to analyze the complexity of solving the corresponding algebraic system with Gr\"obner bases techniques. This computation behaves differently when applied to the skew-symmetric matrix space associated with a random code rather than with a Goppa or an alternant code. This gives a distinguisher of the latter code family. We give a bound on its complexity which turns out to interpolate nicely between polynomial and exponential depending on the code parameters. A distinguisher for alternant/Goppa codes was already known [FGO+11]. It is of polynomial complexity but works only in a narrow parameter regime. This new distinguisher is also polynomial for the parameter regime necessary for [FGO+11] but contrarily to the previous one is able to operate for virtually all code parameters relevant to cryptography. Moreover, we use this matrix space to find a polynomial time attack of the McEliece cryptosystem provided that the Goppa code is distinguishable by the method of [FGO+11] and its degree is less than $q-1$, where $q$ is the alphabet size of the code.
We study the power of randomness in the Number-on-Forehead (NOF) model in communication complexity. We construct an explicit 3-player function $f:[N]^3 \to \{0,1\}$, such that: (i) there exist a randomized NOF protocol computing it that sends a constant number of bits; but (ii) any deterministic or nondeterministic NOF protocol computing it requires sending about $(\log N)^{1/3}$ many bits. This exponentially improves upon the previously best-known such separation. At the core of our proof is an extension of a recent result of the first and third authors on sets of integers without 3-term arithmetic progressions into a non-arithmetic setting.
In this work, we prove uniform continuity bounds for entropic quantities related to the sandwiched R\'enyi divergences such as the sandwiched R\'enyi conditional entropy. We follow three different approaches: The first one is the axiomatic approach, which exploits the sub-/ superadditivity and joint concavity/ convexity of the exponential of the divergence. In our second approach, termed the "operator space approach", we express the entropic measures as norms and utilize their properties for establishing the bounds. These norms draw inspiration from interpolation space norms. We not only demonstrate the norm properties solely relying on matrix analysis tools but also extend their applicability to a context that holds relevance in resource theories. By this, we extend the strategies of Marwah and Dupuis as well as Beigi and Goodarzi employed in the sandwiched R\'enyi conditional entropy context. Finally, we merge the approaches into a mixed approach that has some advantageous properties and then discuss in which regimes each bound performs best. Our results improve over the previous best continuity bounds or sometimes even give the first continuity bounds available. In a separate contribution, we use the ALAAF method, developed in a previous article by some of the authors, to study the stability of approximate quantum Markov chains.
Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the \emph{learning coefficient} introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of "flat" directions. We propose a computationally scalable approximation of a localized version of the learning coefficient using stochastic gradient Langevin dynamics. To validate our approach, we demonstrate its accuracy in low-dimensional models with known theoretical values. Importantly, the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. An experiment on MNIST shows the local learning coefficient can reveal the inductive bias of stochastic opitmizers for more or less degenerate critical points.
Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.