We investigate the function-space optimality (specifically, the Banach-space optimality) of a large class of shallow neural architectures with multivariate nonlinearities/activation functions. To that end, we construct a new family of Banach spaces defined via a regularization operator, the $k$-plane transform, and a sparsity-promoting norm. We prove a representer theorem that states that the solution sets to learning problems posed over these Banach spaces are completely characterized by neural architectures with multivariate nonlinearities. These optimal architectures have skip connections and are tightly connected to orthogonal weight normalization and multi-index models, both of which have received recent interest in the neural network community. Our framework is compatible with a number of classical nonlinearities including the rectified linear unit (ReLU) activation function, the norm activation function, and the radial basis functions found in the theory of thin-plate/polyharmonic splines. We also show that the underlying spaces are special instances of reproducing kernel Banach spaces and variation spaces. Our results shed light on the regularity of functions learned by neural networks trained on data, particularly with multivariate nonlinearities, and provide new theoretical motivation for several architectural choices found in practice.
Heterogeneous Bayesian decentralized data fusion captures the set of problems in which two robots must combine two probability density functions over non-equal, but overlapping sets of random variables. In the context of multi-robot dynamic systems, this enables robots to take a "divide and conquer" approach to reason and share data over complementary tasks instead of over the full joint state space. For example, in a target tracking application, this allows robots to track different subsets of targets and share data on only common targets. This paper presents a framework by which robots can each use a local factor graph to represent relevant partitions of a complex global joint probability distribution, thus allowing them to avoid reasoning over the entirety of a more complex model and saving communication as well as computation costs. From a theoretical point of view, this paper makes contributions by casting the heterogeneous decentralized fusion problem in terms of a factor graph, analyzing the challenges that arise due to dynamic filtering, and then developing a new conservative filtering algorithm that ensures statistical correctness. From a practical point of view, we show how this framework can be used to represent different multi-robot applications and then test it with simulations and hardware experiments to validate and demonstrate its statistical conservativeness, applicability, and robustness to real-world challenges.
Specialized function gradient computing hardware could greatly improve the performance of state-of-the-art optimization algorithms, e.g., based on gradient descent or conjugate gradient methods that are at the core of control, machine learning, and operations research applications. Prior work on such hardware, performed in the context of the Ising Machines and related concepts, is limited to quadratic polynomials and not scalable to commonly used higher-order functions. Here, we propose a novel approach for massively parallel gradient calculations of high-degree polynomials, which is conducive to efficient mixed-signal in-memory computing circuit implementations and whose area complexity scales linearly with the number of variables and terms in the function and, most importantly, independent of its degree. Two flavors of such an approach are proposed. The first is limited to binary-variable polynomials typical in combinatorial optimization problems, while the second type is broader at the cost of a more complex periphery. To validate the former approach, we experimentally demonstrated solving a small-scale third-order Boolean satisfiability problem based on integrated metal-oxide memristor crossbar circuits, one of the most prospective in-memory computing device technologies, with a competitive heuristics algorithm. Simulation results for larger-scale, more practical problems show orders of magnitude improvements in the area, and related advantages in speed and energy efficiency compared to the state-of-the-art. We discuss how our work could enable even higher-performance systems after co-designing algorithms to exploit massively parallel gradient computation.
The loss function plays an important role in optimizing the performance of a learning system. A crucial aspect of the loss function is the assignment of sample weights within a mini-batch during loss computation. In the context of continual learning (CL), most existing strategies uniformly treat samples when calculating the loss value, thereby assigning equal weights to each sample. While this approach can be effective in certain standard benchmarks, its optimal effectiveness, particularly in more complex scenarios, remains underexplored. This is particularly pertinent in training "in the wild," such as with self-training, where labeling is automated using a reference model. This paper introduces the Online Meta-learning for Sample Importance (OMSI) strategy that approximates sample weights for a mini-batch in an online CL stream using an inner- and meta-update mechanism. This is done by first estimating sample weight parameters for each sample in the mini-batch, then, updating the model with the adapted sample weights. We evaluate OMSI in two distinct experimental settings. First, we show that OMSI enhances both learning and retained accuracy in a controlled noisy-labeled data stream. Then, we test the strategy in three standard benchmarks and compare it with other popular replay-based strategies. This research aims to foster the ongoing exploration in the area of self-adaptive CL.
Numerical models have long been used to understand geoscientific phenomena, including tidal currents, crucial for renewable energy production and coastal engineering. However, their computational cost hinders generating data of varying resolutions. As an alternative, deep learning-based downscaling methods have gained traction due to their faster inference speeds. But most of them are limited to only inference fixed scale and overlook important characteristics of target geoscientific data. In this paper, we propose a novel downscaling framework for tidal current data, addressing its unique characteristics, which are dissimilar to images: heterogeneity and local dependency. Moreover, our framework can generate any arbitrary-scale output utilizing a continuous representation model. Our proposed framework demonstrates significantly improved flow velocity predictions by 93.21% (MSE) and 63.85% (MAE) compared to the Baseline model while achieving a remarkable 33.2% reduction in FLOPs.
Proximal causal inference is a recently proposed framework for evaluating causal effects in the presence of unmeasured confounding. For point identification of causal effects, it leverages a pair of so-called treatment and outcome confounding proxy variables, to identify a bridge function that matches the dependence of potential outcomes or treatment variables on the hidden factors to corresponding functions of observed proxies. Unique identification of a causal effect via a bridge function crucially requires that proxies are sufficiently relevant for hidden factors, a requirement that has previously been formalized as a completeness condition. However, completeness is well-known not to be empirically testable, and although a bridge function may be well-defined, lack of completeness, sometimes manifested by availability of a single type of proxy, may severely limit prospects for identification of a bridge function and thus a causal effect; therefore, potentially restricting the application of the proximal causal framework. In this paper, we propose partial identification methods that do not require completeness and obviate the need for identification of a bridge function. That is, we establish that proxies of unobserved confounders can be leveraged to obtain bounds on the causal effect of the treatment on the outcome even if available information does not suffice to identify either a bridge function or a corresponding causal effect of interest. Our bounds are non-smooth functionals of the observed data distribution. As a consequence, in the context of inference, we initially provide a smooth approximation of our bounds. Subsequently, we leverage bootstrap confidence intervals on the approximated bounds. We further establish analogous partial identification results in related settings where identification hinges upon hidden mediators for which proxies are available.
The assumption that data are invariant under the action of a compact group is implicit in many statistical modeling assumptions such as normality, or the assumption of independence and identical distributions. Hence, testing for the presence of such invariances offers a principled way to falsify various statistical models. In this article, we develop sequential, anytime-valid tests of distributional symmetry under the action of general compact groups. The tests that are developed allow for the continuous monitoring of data as it is collected while keeping type-I error guarantees, and include tests for exchangeability and rotational symmetry as special examples. The main tool to this end is the machinery developed for conformal prediction. The resulting test statistic, called a conformal martingale, can be interpreted as a likelihood ratio. We use this interpretation to show that the test statistics are optimal -- in a specific log-optimality sense -- against certain alternatives. Furthermore, we draw a connection between conformal prediction, anytime-valid tests of distributional invariance, and current developments on anytime-valid testing. In particular, we extend existing anytime-valid tests of independence, which leverage exchangeability, to work under general group invariances. Additionally, we discuss testing for invariance under subgroups of the permutation group and orthogonal group, the latter of which corresponds to testing the assumptions behind linear regression models.
The development of nonlinear optimization algorithms capable of performing reliably in the presence of noise has garnered considerable attention lately. This paper advocates for strategies to create noise-tolerant nonlinear optimization algorithms by adapting classical deterministic methods. These adaptations follow certain design guidelines described here, which make use of estimates of the noise level in the problem. The application of our methodology is illustrated by the development of a line search gradient projection method, which is tested on an engineering design problem. It is shown that a new self-calibrated line search and noise-aware finite-difference techniques are effective even in the high noise regime. Numerical experiments investigate the resiliency of key algorithmic components. A convergence analysis of the line search gradient projection method establishes convergence to a neighborhood of the solution.
Markov processes are widely used mathematical models for describing dynamic systems in various fields. However, accurately simulating large-scale systems at long time scales is computationally expensive due to the short time steps required for accurate integration. In this paper, we introduce an inference process that maps complex systems into a simplified representational space and models large jumps in time. To achieve this, we propose Time-lagged Information Bottleneck (T-IB), a principled objective rooted in information theory, which aims to capture relevant temporal features while discarding high-frequency information to simplify the simulation task and minimize the inference error. Our experiments demonstrate that T-IB learns information-optimal representations for accurately modeling the statistical properties and dynamics of the original process at a selected time lag, outperforming existing time-lagged dimensionality reduction methods.
The recent introduction of the Least-Squares Support Vector Regression (LS-SVR) algorithm for solving differential and integral equations has sparked interest. In this study, we expand the application of this algorithm to address systems of differential-algebraic equations (DAEs). Our work presents a novel approach to solving general DAEs in an operator format by establishing connections between the LS-SVR machine learning model, weighted residual methods, and Legendre orthogonal polynomials. To assess the effectiveness of our proposed method, we conduct simulations involving various DAE scenarios, such as nonlinear systems, fractional-order derivatives, integro-differential, and partial DAEs. Finally, we carry out comparisons between our proposed method and currently established state-of-the-art approaches, demonstrating its reliability and effectiveness.
Polar codes are the first class of structured channel codes that achieve the symmetric capacity of binary channels with efficient encoding and decoding. In 2019, Arikan proposed a new polar coding scheme referred to as polarization-adjusted convolutional (PAC)} codes. In contrast to polar codes, PAC codes precode the information word using a convolutional code prior to polar encoding. This results in material coding gain over polar code under Fano sequential decoding as well as successive cancellation list (SCL) decoding. Given the advantages of SCL decoding over Fano decoding in certain scenarios such as low-SNR regime or where a constraint on the worst case decoding latency exists, in this paper, we focus on SCL decoding and present a simplified SCL (SSCL) decoding algorithm for PAC codes. SSCL decoding of PAC codes reduces the decoding latency by identifying special nodes in the decoding tree and processing them at the intermediate stages of the graph. Our simulation results show that the performance of PAC codes under SSCL decoding is almost similar to the SCL decoding while having lower decoding latency.