It is well known that the spectral gap of the down-up walk over an $n$-partite simplicial complex (also known as Glauber dynamics) cannot be better than $O(1/n)$ due to natural obstructions such as coboundaries. We study an alternative random walk over partite simplicial complexes known as the sequential sweep or the systematic scan Glauber dynamics: Whereas the down-up walk at each step selects a random coordinate and updates it based on the remaining coordinates, the sequential sweep goes through each of the coordinates one by one in a deterministic order and applies the same update operation. It is natural, thus, to compare $n$-steps of the down-up walk with a single step of the sequential sweep. Interestingly, while the spectral gap of the $n$-th power of the down-up walk is still bounded from above by a constant, under a strong enough local spectral assumption (in the sense of Gur, Lifschitz, Liu, STOC 2022) we can show that the spectral gap of this walk can be arbitrarily close to 1. We also study other isoperimetric inequalities for these walks, and show that under the assumptions of local entropy contraction (related to the considerations of Gur, Lifschitz, Liu), these walks satisfy an entropy contraction inequality.
We consider Rote words, which are infinite binary words with factor complexity $2n$. We prove that the repetition threshold for this class is $5/2$. Our technique is purely computational, using the Walnut theorem prover and a new technique for generating automata from morphisms due to the first author and his co-authors.
The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. It is therefore unsurprising that the practice of machine learning is rife with considerably richer algorithmic techniques for successfully controlling model capacity. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam's Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian reasoning. Most notably, we introduce an optimal learner which relaxes structural risk minimization on two dimensions: it allows the regularization function to be "local" to datapoints, and uses an unsupervised learning stage to learn this regularizer at the outset. We justify these relaxations by showing that they are necessary: removing either dimension fails to yield a near-optimal learner. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem's transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs -- judged using nodes' outdegrees minus a system of node-dependent credits -- characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.
Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.
Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architectural components. In this paper, we propose an enhanced approach that boosts MAE performance by integrating pseudo labelling for both class and data tokens, alongside replacing the traditional pixel-level reconstruction with token-level reconstruction. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network, while token reconstruction requires generation of discrete tokens encapturing local context. The targets for pseudo labelling and reconstruction needs to be generated by a teacher network. To disentangle the generation of target pseudo labels and the reconstruction of the token features, we decouple the teacher into two distinct models, where one serves as a labelling teacher and the other as a reconstruction teacher. This separation proves empirically superior to a single teacher, while having negligible impact on throughput and memory consumption. Incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks, including classification, semantic segmentation, and detection.
Database systems are often confronted with queries that join many tables but ultimately only output comparatively small aggregate information. Despite all advances in query optimisation, the explosion of intermediate results as opposed to a much smaller final result challenges modern relational database management systems (DBMSs). In this work, we propose the integration of optimisation techniques into relational DBMSs that aim at minimising, and often entirely eliminating, the need for materialising join results for aggregate queries, provided that they satisfy certain conditions. Apart from novel logical optimisations aimed at practicability, we also provide new, natural, physical operators for combining joins and counting with the aim of reducing the size of intermediate results. We experimentally validate the efficacy of our optimisations through their implementation in Spark SQL, but we note that they are naturally applicable in any RDBMS. Our experiments show consistent significant speed-ups -- often by factor 2 and higher -- for analytical and graph queries. At the same time, we observe no performance degradation, even on queries which, from a theoretical point of view, are least amenable to the proposed optimisations.
2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code is available at //github.com/nomewang/M3DM.
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying and disentangling the underlying factors hidden in the observable data in representation form. The process of separating underlying factors of variation into variables with semantic meaning benefits in learning explainable representations of data, which imitates the meaningful understanding process of humans when observing an object or relation. As a general learning strategy, DRL has demonstrated its power in improving the model explainability, controlability, robustness, as well as generalization capacity in a wide range of scenarios such as computer vision, natural language processing, data mining etc. In this article, we comprehensively review DRL from various aspects including motivations, definitions, methodologies, evaluations, applications and model designs. We discuss works on DRL based on two well-recognized definitions, i.e., Intuitive Definition and Group Theory Definition. We further categorize the methodologies for DRL into four groups, i.e., Traditional Statistical Approaches, Variational Auto-encoder Based Approaches, Generative Adversarial Networks Based Approaches, Hierarchical Approaches and Other Approaches. We also analyze principles to design different DRL models that may benefit different tasks in practical applications. Finally, we point out challenges in DRL as well as potential research directions deserving future investigations. We believe this work may provide insights for promoting the DRL research in the community.
Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.