This work is on a user-friendly reduced basis method for solving a family of parametric PDEs by preconditioned Krylov subspace methods including the conjugate gradient method, generalized minimum residual method, and bi-conjugate gradient method. The proposed methods use a preconditioned Krylov subspace method for a high-fidelity discretization of one parameter instance to generate orthogonal basis vectors of the reduced basis subspace. Then large-scale discrete parameter-dependent problems are approximately solved in the low-dimensional Krylov subspace. As shown in the theory and experiments, only a small number of Krylov subspace iterations are needed to simultaneously generate approximate solutions of a family of high-fidelity and large-scale systems in the reduced basis subspace. This reduces the computational cost dramatically because (1) to construct the reduced basis vectors, we only solve one large-scale problem in the high-fidelity level; and (2) the family of large-scale problems restricted to the reduced basis subspace have much smaller sizes.
Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.
The noisy permutation channel is a useful abstraction introduced by Makur for point-to-point communication networks and biological storage. While the asymptotic capacity results exist for this model, the characterization of the second-order asymptotics is not available. Therefore, we analyze the converse bounds for the noisy permutation channel in the finite blocklength regime. To do this, we present a modified minimax meta-converse for noisy permutation channels by symbol relaxation. To derive the second-order asymptotics of the converse bound, we propose a way to use divergence covering in analysis. It enables the observation of the second-order asymptotics and the strong converse via Berry-Esseen type bounds. These two conclusions hold for noisy permutation channels with strictly positive matrices (entry-wise). In addition, we obtain computable bounds for the noisy permutation channel with the binary symmetric channel (BSC), including the original computable converse bound based on the modified minimax meta-converse, the asymptotic expansion derived from our subset covering technique, and the {\epsilon}-capacity result. We find that a smaller crossover probability provides a higher upper bound for a fixed finite blocklength, although the {\epsilon}-capacity is agnostic to the BSC parameter. Finally, numerical results show that the normal approximation shows remarkable precision, and our new converse bound is stronger than previous bounds.
We present a new extended resolution clause learning (ERCL) algorithm, implemented as part of a conflict-driven clause-learning (CDCL) SAT solver, wherein new variables are dynamically introduced as definitions for {\it Dual Implication Points} (DIPs) in the implication graph constructed by the solver at runtime. DIPs are generalizations of unique implication points and can be informally viewed as a pair of dominator nodes, from the decision variable at the highest decision level to the conflict node, in an implication graph. We perform extensive experimental evaluation to establish the efficacy of our ERCL method, implemented as part of the MapleLCM SAT solver and dubbed xMapleLCM, against several leading solvers including the baseline MapleLCM, as well as CDCL solvers such as Kissat 3.1.1, CryptoMiniSat 5.11, and SBVA+CaDiCaL, the winner of SAT Competition 2023. We show that xMapleLCM outperforms these solvers on Tseitin and XORified formulas. We further compare xMapleLCM with GlucoseER, a system that implements extended resolution in a different way, and provide a detailed comparative analysis of their performance.
The randomized Kaczmarz methods are a popular and effective family of iterative methods for solving large-scale linear systems of equations, which have also been applied to linear feasibility problems. In this work, we propose a new block variant of the randomized Kaczmarz method, B-MRK, for solving linear feasibility problems defined by matrices. We show that B-MRK converges linearly in expectation to the feasible region.Furthermore, we extend the method to solve tensor linear feasibility problems defined under the tensor t-product. A tensor randomized Kaczmarz (TRK) method, TRK-L, is proposed for solving linear feasibility problems that involve mixed equality and inequality constraints. Additionally, we introduce another TRK method, TRK-LB, specifically tailored for cases where the feasible region is defined by linear equality constraints coupled with bound constraints on the variables. We show that both of the TRK methods converge linearly in expectation to the feasible region. Moreover, the effectiveness of our methods is demonstrated through numerical experiments on various Gaussian random data and applications in image deblurring.
To generate coherent responses, language models infer unobserved meaning from their input text sequence. One potential explanation for this capability arises from theories of delay embeddings in dynamical systems, which prove that unobserved variables can be recovered from the history of only a handful of observed variables. To test whether language models are effectively constructing delay embeddings, we measure the capacities of sequence models to reconstruct unobserved dynamics. We trained 1-layer transformer decoders and state-space sequence models on next-step prediction from noisy, partially-observed time series data. We found that each sequence layer can learn a viable embedding of the underlying system. However, state-space models have a stronger inductive bias than transformers-in particular, they more effectively reconstruct unobserved information at initialization, leading to more parameter-efficient models and lower error on dynamics tasks. Our work thus forges a novel connection between dynamical systems and deep learning sequence models via delay embedding theory.
Graph neural networks (GNNs) is widely used to learn a powerful representation of graph-structured data. Recent work demonstrates that transferring knowledge from self-supervised tasks to downstream tasks could further improve graph representation. However, there is an inherent gap between self-supervised tasks and downstream tasks in terms of optimization objective and training data. Conventional pre-training methods may be not effective enough on knowledge transfer since they do not make any adaptation for downstream tasks. To solve such problems, we propose a new transfer learning paradigm on GNNs which could effectively leverage self-supervised tasks as auxiliary tasks to help the target task. Our methods would adaptively select and combine different auxiliary tasks with the target task in the fine-tuning stage. We design an adaptive auxiliary loss weighting model to learn the weights of auxiliary tasks by quantifying the consistency between auxiliary tasks and the target task. In addition, we learn the weighting model through meta-learning. Our methods can be applied to various transfer learning approaches, it performs well not only in multi-task learning but also in pre-training and fine-tuning. Comprehensive experiments on multiple downstream tasks demonstrate that the proposed methods can effectively combine auxiliary tasks with the target task and significantly improve the performance compared to state-of-the-art methods.
Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
Providing model-generated explanations in recommender systems is important to user experience. State-of-the-art recommendation algorithms -- especially the collaborative filtering (CF) based approaches with shallow or deep models -- usually work with various unstructured information sources for recommendation, such as textual reviews, visual images, and various implicit or explicit feedbacks. Though structured knowledge bases were considered in content-based approaches, they have been largely ignored recently due to the availability of vast amount of data and the learning power of many complex models. However, structured knowledge bases exhibit unique advantages in personalized recommendation systems. When the explicit knowledge about users and items is considered for recommendation, the system could provide highly customized recommendations based on users' historical behaviors and the knowledge is helpful for providing informed explanations regarding the recommended items. In this work, we propose to reason over knowledge base embeddings for explainable recommendation. Specifically, we propose a knowledge base representation learning framework to embed heterogeneous entities for recommendation, and based on the embedded knowledge base, a soft matching algorithm is proposed to generate personalized explanations for the recommended items. Experimental results on real-world e-commerce datasets verified the superior recommendation performance and the explainability power of our approach compared with state-of-the-art baselines.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.