We establish a phase transition known as the "all-or-nothing" phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with arbitrary sublinear sparsity. Over the past several years, the all-or-nothing phenomenon has been established in various models as an outcome of two seemingly disjoint results: one positive result establishing the "all" half of all-or-nothing, and one impossibility result establishing the "nothing" half. Our main technique in the present work is to show that for noiseless discrete channels, the "all" half implies the "nothing" half, that is a proof of "all" can be turned into a proof of "nothing." Since the "all" half can often be proven by straightforward means -- for instance, by the first-moment method -- our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.
Minimum variance controllers have been employed in a wide-range of industrial applications. A key challenge experienced by many adaptive controllers is their poor empirical performance in the initial stages of learning. In this paper, we address the problem of initializing them so that they provide acceptable transients, and also provide an accompanying finite-time regret analysis, for adaptive minimum variance control of an auto-regressive system with exogenous inputs (ARX). Following [3], we consider a modified version of the Certainty Equivalence (CE) adaptive controller, which we call PIECE, that utilizes probing inputs for exploration. We show that it has a $C \log T$ bound on the regret after $T$ time-steps for bounded noise, and $C\log^2 T$ in the case of sub-Gaussian noise. The simulation results demonstrate the advantage of PIECE over the algorithm proposed in [3] as well as the standard Certainty Equivalence controller especially in the initial learning phase. To the best of our knowledge, this is the first work that provides finite-time regret bounds for an adaptive minimum variance controller.
Recent ODE/SDE-based generative models, such as diffusion models, rectified flows, and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at //github.com/sangyun884/fast-ode.
This paper explores the implicit bias of overparameterized neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different implicitly defined representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding linear layers to a ReLU network yields a representation cost that favors functions that can be approximated by a low-rank linear operator composed with a function with low representation cost using a two-layer network. Specifically, using a neural network to fit training data with minimum representation cost yields an interpolating function that is nearly constant in directions orthogonal to a low-dimensional subspace. This means that the learned network will approximately be a single- or multiple-index model. Our experiments show that when this active subspace structure exists in the data, adding linear layers can improve generalization and result in a network that is well-aligned with the true active subspace.
This paper characterizes the trade-offs between information and energy transmission over an additive white Gaussian noise channel in the finite block-length regime with finite sets of channel input symbols. These trade-offs are characterized using impossibility and achievability bounds on the information transmission rate, energy transmission rate, decoding error probability (DEP) and energy outage probability (EOP) for a finite block-length code. Given a set of channel input symbols, the impossibility results identify the tuples of information rate, energy rate, DEP and EOP that cannot be achieved by any code using the given set of channel inputs. A novel method for constructing a family of codes that satisfy a target information rate, energy rate, DEP and EOP is also proposed. The achievability bounds identify the set of tuples of information rate, energy rate, DEP and EOP that can be simultaneously achieved by the constructed family of codes. The proposed construction matches the impossibility bounds for the information rate, energy rate, and the EOP. However, for a given information rate, energy rate and EOP, the achieved DEP is higher than the impossibility bound due to the choice of the decoding sets made during the code construction.
The examination of uncertainty in the predictions of machine learning (ML) models is receiving increasing attention. One uncertainty modeling technique used for this purpose is Monte-Carlo (MC)-Dropout, where repeated predictions are generated for a single input. Therefore, clustering is required to describe the resulting uncertainty, but only through efficient clustering is it possible to describe the uncertainty from the model attached to each object. This article uses Bayesian Gaussian Mixture (BGM) to solve this problem. In addition, we investigate different values for the dropout rate and other techniques, such as focal loss and calibration, which we integrate into the Mask-RCNN model to obtain the most accurate uncertainty approximation of each instance and showcase it graphically.
Time-sensitive networks require timely and accurate monitoring of the status of the network. To achieve this, many devices send packets periodically, which are then aggregated and forwarded to the controller. Bounding the aggregate burstiness of the traffic is then crucial for effective resource management. In this paper, we are interested in bounding this aggregate burstiness for independent and periodic flows. A deterministic bound is tight only when flows are perfectly synchronized, which is highly unlikely in practice and would be overly pessimistic. We compute the probability that the aggregate burstiness exceeds some value. When all flows have the same period and packet size, we obtain a closed-form bound using the Dvoretzky-Kiefer-Wolfowitz inequality. In the heterogeneous case, we group flows and combine the bounds obtained for each group using the convolution bound. Our bounds are numerically close to simulations and thus fairly tight. The resulting aggregate burstiness estimated for a non-zero violation probability is considerably smaller than the deterministic one: it grows in $\sqrt{n\log{n}}$, instead of $n$, where $n$ is the number of flows.
Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM+ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM+-based algorithms in matrix and extensive-form games.
Linear discriminant analysis (LDA) has been a useful tool in pattern recognition and data analysis research and practice. While linearity of class boundaries cannot always be expected, nonlinear projections through pre-trained deep neural networks have served to map complex data onto feature spaces in which linear discrimination has served well. The solution to binary LDA is obtained by eigenvalue analysis of within-class and between-class scatter matrices. It is well known that the multiclass LDA is solved by an extension to the binary LDA, a generalised eigenvalue problem, from which the largest subspace that can be extracted is of dimension one lower than the number of classes in the given problem. In this paper, we show that, apart from the first of the discriminant directions, the generalised eigenanalysis solution to multiclass LDA does neither yield orthogonal discriminant directions nor maximise discrimination of projected data along them. Surprisingly, to the best of our knowledge, this has not been noted in decades of literature on LDA. To overcome this drawback, we present a derivation with a strict theoretical support for sequentially obtaining discriminant directions that are orthogonal to previously computed ones and maximise in each step the Fisher criterion. We show distributions of projections along these axes and demonstrate that discrimination of data projected onto these discriminant directions has optimal separation, which is much higher than those from the generalised eigenvectors of the multiclass LDA. Using a wide range of benchmark tasks, we present a comprehensive empirical demonstration that on a number of pattern recognition and classification problems, the optimal discriminant subspaces obtained by the proposed method, referred to as GO-LDA (Generalised Optimal LDA), can offer superior accuracy.
The need for deep neural network (DNN) models with higher performance and better functionality leads to the proliferation of very large models. Model training, however, requires intensive computation time and energy. Memristor-based compute-in-memory (CIM) modules can perform vector-matrix multiplication (VMM) in situ and in parallel, and have shown great promises in DNN inference applications. However, CIM-based model training faces challenges due to non-linear weight updates, device variations, and low-precision in analog computing circuits. In this work, we experimentally implement a mixed-precision training scheme to mitigate these effects using a bulk-switching memristor CIM module. Lowprecision CIM modules are used to accelerate the expensive VMM operations, with high precision weight updates accumulated in digital units. Memristor devices are only changed when the accumulated weight update value exceeds a pre-defined threshold. The proposed scheme is implemented with a system-on-chip (SoC) of fully integrated analog CIM modules and digital sub-systems, showing fast convergence of LeNet training to 97.73%. The efficacy of training larger models is evaluated using realistic hardware parameters and shows that that analog CIM modules can enable efficient mix-precision DNN training with accuracy comparable to full-precision software trained models. Additionally, models trained on chip are inherently robust to hardware variations, allowing direct mapping to CIM inference chips without additional re-training.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.