Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning. Most previous low-rank network compression methods compress the networks by approximating pre-trained models and re-training. However, the optimal solution in the Euclidean space may be quite different from the one in the low-rank manifold. A well-pre-trained model is not a good initialization for the model with low-rank constraints. Thus, the performance of a low-rank compressed network degrades significantly. Compared to other network compression methods such as pruning, low-rank methods attracts less attention in recent years. In this paper, we devise a new training method, low-rank projection with energy transfer (LRPET), that trains low-rank compressed networks from scratch and achieves competitive performance. First, we propose to alternately perform stochastic gradient descent training and projection onto the low-rank manifold. Compared to re-training on the compact model, this enables full utilization of model capacity since solution space is relaxed back to Euclidean space after projection. Second, the matrix energy (the sum of squares of singular values) reduction caused by projection is compensated by energy transfer. We uniformly transfer the energy of the pruned singular values to the remaining ones. We theoretically show that energy transfer eases the trend of gradient vanishing caused by projection. Third, we propose batch normalization (BN) rectification to cut off its effect on the optimal low-rank approximation of the weight matrix, which further improves the performance. Comprehensive experiments on CIFAR-10 and ImageNet have justified that our method is superior to other low-rank compression methods and also outperforms recent state-of-the-art pruning methods. Our code is available at //github.com/BZQLin/LRPET.
In turbulence modeling, and more particularly in the Large-Eddy Simulation (LES) framework, we are concerned with finding closure models that represent the effect of the unresolved subgrid-scales on the resolved scales. Recent approaches gravitate towards machine learning techniques to construct such models. However, the stability of machine-learned closure models and their abidance by physical structure (e.g. symmetries, conservation laws) are still open problems. To tackle both issues, we take the `discretize first, filter next' approach, in which we apply a spatial averaging filter to existing energy-conserving (fine-grid) discretizations. The main novelty is that we extend the system of equations describing the filtered solution with a set of equations that describe the evolution of (a compressed version of) the energy of the subgrid-scales. Having an estimate of this energy, we can use the concept of energy conservation and derive stability. The compressed variables are determined via a data-driven technique in such a way that the energy of the subgrid-scales is matched. For the extended system, the closure model should be energy-conserving, and a new skew-symmetric convolutional neural network architecture is proposed that has this property. Stability is thus guaranteed, independent of the actual weights and biases of the network. Importantly, our framework allows energy exchange between resolved scales and compressed subgrid scales and thus enables backscatter. To model dissipative systems (e.g. viscous flows), the framework is extended with a diffusive component. The introduced neural network architecture is constructed such that it also satisfies momentum conservation. We apply the new methodology to both the viscous Burgers' equation and the Korteweg-De Vries equation in 1D and show superior stability properties when compared to a vanilla convolutional neural network.
The computation of the distance of two time series is time-consuming for any elastic distance function that accounts for misalignments. Among those functions, DTW is the most prominent. However, a recent extensive evaluation has shown that the move-split merge (MSM) metric is superior to DTW regarding the analytical accuracy of the 1-NN classifier. Unfortunately, the running time of the standard dynamic programming algorithm for MSM distance computation is $\Omega(n^2)$, where $n$ is the length of the longest time series. In this paper, we provide approaches to reducing the cost of MSM distance computations by using lower and upper bounds for early pruning paths in the underlying dynamic programming table. For the case of one time series being a constant, we present a linear-time algorithm. In addition, we propose new linear-time heuristics and adapt heuristics known from DTW to computing the MSM distance. One heuristic employs the metric property of MSM and the previously introduced linear-time algorithm. Our experimental studies demonstrate substantial speed-ups in our approaches compared to previous MSM algorithms. In particular, the running time for MSM is faster than a state-of-the-art DTW distance computation for a majority of the popular UCR data sets.
The paper tackles the problem of clustering multiple networks, that do not share the same set of vertices, into groups of networks with similar topology. A statistical model-based approach based on a finite mixture of stochastic block models is proposed. A clustering is obtained by maximizing the integrated classification likelihood criterion. This is done by a hierarchical agglomerative algorithm, that starts from singleton clusters and successively merges clusters of networks. As such, a sequence of nested clusterings is computed that can be represented by a dendrogram providing valuable insights on the collection of networks. Using a Bayesian framework, model selection is performed in an automated way since the algorithm stops when the best number of clusters is attained. The algorithm is computationally efficient, when carefully implemented. The aggregation of groups of networks requires a means to overcome the label-switching problem of the stochastic block model and to match the block labels of the graphs. To address this problem, a new tool is proposed based on a comparison of the graphons of the associated stochastic block models. The clustering approach is assessed on synthetic data. An application to a collection of ecological networks illustrates the interpretability of the obtained results.
The ability to plan actions on multiple levels of abstraction enables intelligent agents to solve complex tasks effectively. However, learning the models for both low and high-level planning from demonstrations has proven challenging, especially with higher-dimensional inputs. To address this issue, we propose to use reinforcement learning to identify subgoals in expert trajectories by associating the magnitude of the rewards with the predictability of low-level actions given the state and the chosen subgoal. We build a vector-quantized generative model for the identified subgoals to perform subgoal-level planning. In experiments, the algorithm excels at solving complex, long-horizon decision-making problems outperforming state-of-the-art. Because of its ability to plan, our algorithm can find better trajectories than the ones in the training set
Detecting out-of-distribution (OOD) inputs is crucial for the safe deployment of natural language processing (NLP) models. Though existing methods, especially those based on the statistics in the feature space of fine-tuned pre-trained language models (PLMs), are claimed to be effective, their effectiveness on different types of distribution shifts remains underexplored. In this work, we take the first step to comprehensively evaluate the mainstream textual OOD detection methods for detecting semantic and non-semantic shifts. We find that: (1) no existing method behaves well in both settings; (2) fine-tuning PLMs on in-distribution data benefits detecting semantic shifts but severely deteriorates detecting non-semantic shifts, which can be attributed to the distortion of task-agnostic features. To alleviate the issue, we present a simple yet effective general OOD score named GNOME that integrates the confidence scores derived from the task-agnostic and task-specific representations. Experiments show that GNOME works well in both semantic and non-semantic shift scenarios, and further brings significant improvement on two cross-task benchmarks where both kinds of shifts simultaneously take place. Our code is available at //github.com/lancopku/GNOME.
As the size of the pre-trained language model (PLM) continues to increase, numerous parameter-efficient transfer learning methods have been proposed recently to compensate for the tremendous cost of fine-tuning. Despite the impressive results achieved by large pre-trained language models (PLMs) and various parameter-efficient transfer learning (PETL) methods on sundry benchmarks, it remains unclear if they can handle inputs that have been distributionally shifted effectively. In this study, we systematically explore how the ability to detect out-of-distribution (OOD) changes as the size of the PLM grows or the transfer methods are altered. Specifically, we evaluated various PETL techniques, including fine-tuning, Adapter, LoRA, and prefix-tuning, on three different intention classification tasks, each utilizing various language models with different scales.
In this work, we develop an online adaptive enrichment method within the framework of the Constraint Energy Minimizing Generalized Multiscale Finite Element Method (CEM-GMsFEM) for solving the linear heterogeneous poroelasticity models with coefficients of high contrast. The proposed method makes use of information of residual-driven error indicators to enrich the multiscale spaces for both the displacement and the pressure variables in the model. Additional online basis functions are constructed in oversampled regions accordingly and are adaptively chosen to reduce the error the most. A complete theoretical analysis of the online enrichment algorithm is provided and justified by thorough numerical experiments.
Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.
Signalized intersections in arterial roads result in persistent vehicle idling and excess accelerations, contributing to fuel consumption and CO2 emissions. There has thus been a line of work studying eco-driving control strategies to reduce fuel consumption and emission levels at intersections. However, methods to devise effective control strategies across a variety of traffic settings remain elusive. In this paper, we propose a reinforcement learning (RL) approach to learn effective eco-driving control strategies. We analyze the potential impact of a learned strategy on fuel consumption, CO2 emission, and travel time and compare with naturalistic driving and model-based baselines. We further demonstrate the generalizability of the learned policies under mixed traffic scenarios. Simulation results indicate that scenarios with 100% penetration of connected autonomous vehicles (CAV) may yield as high as 18% reduction in fuel consumption and 25% reduction in CO2 emission levels while even improving travel speed by 20%. Furthermore, results indicate that even 25% CAV penetration can bring at least 50% of the total fuel and emission reduction benefits.
Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.