We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.
Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. In this paper, we introduce a method called \textbf{T}ill the \textbf{L}ayers \textbf{C}ollapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks' computational requirements and overall latency. We validate our method on popular models such as Swin-T, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks.
Let $G$ be a graph of a network system with vertices, $V(G)$, representing physical locations and edges, $E(G)$, representing informational connectivity. A \emph{locating-dominating (LD)} set $S \subseteq V(G)$ is a subset of vertices representing detectors capable of sensing an "intruder" at precisely their location or somewhere in their open-neighborhood -- an LD set must be capable of locating an intruder anywhere in the graph. We explore three types of fault-tolerant LD sets: \emph{redundant LD} sets, which allow a detector to be removed, \emph{error-detecting LD} sets, which allow at most one false negative, and \emph{error-correcting LD} sets, which allow at most one error (false positive or negative). In particular, we determine lower and upper bounds for the minimum density of these three fault-tolerant locating-dominating sets in the \emph{infinite king grid}.
Schr\"{o}dinger Bridges (SB) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. Furthermore, our method generalizes naturally to more general classes of dynamical systems such as controllable Linear Time-Varying systems that cannot currently be solved using traditional neural SB approaches. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, and various other examples. We also benchmark our approach on an Entropic Optimal Transport (EOT) problem and show that it outperforms state-of-the-art methods in cases where the boundary distributions are mixture models while requiring virtually no training.
Deep generative models, which target reproducing the given data distribution to produce novel samples, have made unprecedented advancements in recent years. Their technical breakthroughs have enabled unparalleled quality in the synthesis of visual content. However, one critical prerequisite for their tremendous success is the availability of a sufficient number of training samples, which requires massive computation resources. When trained on limited data, generative models tend to suffer from severe performance deterioration due to overfitting and memorization. Accordingly, researchers have devoted considerable attention to develop novel models that are capable of generating plausible and diverse images from limited training data recently. Despite numerous efforts to enhance training stability and synthesis quality in the limited data scenarios, there is a lack of a systematic survey that provides 1) a clear problem definition, critical challenges, and taxonomy of various tasks; 2) an in-depth analysis on the pros, cons, and remain limitations of existing literature; as well as 3) a thorough discussion on the potential applications and future directions in the field of image synthesis under limited data. In order to fill this gap and provide a informative introduction to researchers who are new to this topic, this survey offers a comprehensive review and a novel taxonomy on the development of image synthesis under limited data. In particular, it covers the problem definition, requirements, main solutions, popular benchmarks, and remain challenges in a comprehensive and all-around manner.
Parallel kinematic manipulators (PKM) are characterized by closed kinematic loops, due to the parallel arrangement of limbs but also due to the existence of kinematic loops within the limbs. Moreover, many PKM are built with limbs constructed by serially combining kinematic loops. Such limbs are called hybrid, which form a particular class of complex limbs. Design and model-based control requires accurate dynamic PKM models desirably without model simplifications. Dynamics modeling then necessitates kinematic relations of all members of the PKM, in contrast to the standard kinematics modeling of PKM, where only the forward and inverse kinematics solution for the manipulator (relating input and output motions) are computed. This becomes more involved for PKM with hybrid limbs. In this paper a modular modeling approach is employed, where limbs are treated separately, and the individual dynamic equations of motions (EOM) are subsequently assembled to the overall model. Key to the kinematic modeling is the constraint resolution for the individual loops within the limbs. This local constraint resolution is a special case of the general \emph{constraint embedding} technique. The proposed method finally allows for a systematic modeling of general PKM. The method is demonstrated for the IRSBot-2, where each limb comprises two independent loops.
Cyber resilience is the ability of a system to resist and recover from a cyber attack, thereby restoring the system's functionality. Effective design and development of a cyber resilient system requires experimental methods and tools for quantitative measuring of cyber resilience. This paper describes an experimental method and test bed for obtaining resilience-relevant data as a system (in our case -- a truck) traverses its route, in repeatable, systematic experiments. We model a truck equipped with an autonomous cyber-defense system and which also includes inherent physical resilience features. When attacked by malware, this ensemble of cyber-physical features (i.e., "bonware") strives to resist and recover from the performance degradation caused by the malware's attack. We propose parsimonious mathematical models to aid in quantifying systems' resilience to cyber attacks. Using the models, we identify quantitative characteristics obtainable from experimental data, and show that these characteristics can serve as useful quantitative measures of cyber resilience.
Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, and self-supervised learning carries multimedia recommendation to a plateau of performance, because of its superior performance in aligning different modalities. However, more and more research finds that aligning all modal representations is suboptimal because it damages the unique attributes of each modal. These studies use subtraction and orthogonal constraints in geometric space to learn unique parts. However, our rigorous analysis reveals the flaws in this approach, such as that subtraction does not necessarily yield the desired modal-unique and that orthogonal constraints are ineffective in user and item high-dimensional representation spaces. To make up for the previous weaknesses, we propose Separate Learning (SEA) for multimedia recommendation, which mainly includes mutual information view of modal-unique and -generic learning. Specifically, we first use GNN to learn the representations of users and items in different modalities and split each modal representation into generic and unique parts. We employ contrastive log-ratio upper bound to minimize the mutual information between the general and unique parts within the same modality, to distance their representations, thus learning modal-unique features. Then, we design Solosimloss to maximize the lower bound of mutual information, to align the general parts of different modalities, thus learning more high-quality modal-generic features. Finally, extensive experiments on three datasets demonstrate the effectiveness and generalization of our proposed framework. The code is available at SEA and the full training record of the main experiment.
Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.