Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. Importantly, the main aim in theory works on the subject is to understand the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number $N$ of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite-dimensional RKHS, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions,
Contrastive learning often relies on comparing positive anchor samples with multiple negative samples to perform Self-Supervised Learning (SSL). However, non-contrastive approaches like BYOL, SimSiam, and Barlow Twins achieve SSL without explicit negative samples. In this paper, we introduce a unified matrix information-theoretic framework that explains many contrastive and non-contrastive learning methods. We then propose a novel method Matrix-SSL based on matrix information theory. Experimental results reveal that Matrix-SSL significantly outperforms state-of-the-art methods on the ImageNet dataset under linear evaluation settings and on MS-COCO for transfer learning tasks. Specifically, when performing 100 epochs pre-training, our method outperforms SimCLR by 4.6%, and when performing transfer learning tasks on MS-COCO, our method outperforms previous SOTA methods such as MoCo v2 and BYOL up to 3.3% with only 400 epochs compared to 800 epochs pre-training. Code available at //github.com/yifanzhang-pro/Matrix-SSL.
Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization. Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training, greatly expanding the application scenarios of MTL. However, by visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias. That is, there is a significant discrepancy in the representation distribution between the merged and individual models, resulting in poor performance of merged MTL. In this paper, we propose a representation surgery solution called "Surgery" to reduce representation bias in the merged model. Specifically, Surgery is a lightweight task-specific module that takes the representation of the merged model as input and attempts to output the biases contained in the representation from the merged model. We then designed an unsupervised optimization objective that updates the Surgery module by minimizing the distance between the merged model's representation and the individual model's representation. Extensive experiments demonstrate significant MTL performance improvements when our Surgery module is applied to state-of-the-art (SOTA) model merging schemes.
Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $\epsilon$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.
Efficient implementation of massive multiple-input-multiple-output (MIMO) transceivers is essential for the next-generation wireless networks. To reduce the high computational complexity of the massive MIMO transceiver, in this paper, we propose a new massive MIMO architecture using finite-precision arithmetic. First, we conduct the rounding error analysis and derive the lower bound of the achievable rate for single-input-multiple-output (SIMO) using maximal ratio combining (MRC) and multiple-input-single-output (MISO) systems using maximal ratio transmission (MRT) with finite-precision arithmetic. Then, considering the multi-user scenario, the rounding error analysis of zero-forcing (ZF) detection and precoding is derived by using the normal equations (NE) method. The corresponding lower bounds of the achievable sum rate are also derived and asymptotic analyses are presented. Built upon insights from these analyses and lower bounds, we propose a mixed-precision architecture for massive MIMO systems to offset performance gaps due to finite-precision arithmetic. The corresponding analysis of rounding errors and computational costs is obtained. Simulation results validate the derived bounds and underscore the superiority of the proposed mixed-precision architecture to the conventional structure.
To address the communication bottleneck challenge in distributed learning, our work introduces a novel two-stage quantization strategy designed to enhance the communication efficiency of distributed Stochastic Gradient Descent (SGD). The proposed method initially employs truncation to mitigate the impact of long-tail noise, followed by a non-uniform quantization of the post-truncation gradients based on their statistical characteristics. We provide a comprehensive convergence analysis of the quantized distributed SGD, establishing theoretical guarantees for its performance. Furthermore, by minimizing the convergence error, we derive optimal closed-form solutions for the truncation threshold and non-uniform quantization levels under given communication constraints. Both theoretical insights and extensive experimental evaluations demonstrate that our proposed algorithm outperforms existing quantization schemes, striking a superior balance between communication efficiency and convergence performance.
Enhancing accurate molecular property prediction relies on effective and proficient representation learning. It is crucial to incorporate diverse molecular relationships characterized by multi-similarity (self-similarity and relative similarities) between molecules. However, current molecular representation learning methods fall short in exploring multi-similarity and often underestimate the complexity of relationships between molecules. Additionally, previous multi-similarity approaches require the specification of positive and negative pairs to attribute distinct predefined weights to different relative similarities, which can introduce potential bias. In this work, we introduce Graph Multi-Similarity Learning for Molecular Property Prediction (GraphMSL) framework, along with a novel approach to formulate a generalized multi-similarity metric without the need to define positive and negative pairs. In each of the chemical modality spaces (e.g.,molecular depiction image, fingerprint, NMR, and SMILES) under consideration, we first define a self-similarity metric (i.e., similarity between an anchor molecule and another molecule), and then transform it into a generalized multi-similarity metric for the anchor through a pair weighting function. GraphMSL validates the efficacy of the multi-similarity metric across MoleculeNet datasets. Furthermore, these metrics of all modalities are integrated into a multimodal multi-similarity metric, which showcases the potential to improve the performance. Moreover, the focus of the model can be redirected or customized by altering the fusion function. Last but not least, GraphMSL proves effective in drug discovery evaluations through post-hoc analyses of the learnt representations.
In the current era of vast data and transparent machine learning, it is essential for techniques to operate at a large scale while providing a clear mathematical comprehension of the internal workings of the method. Although there already exist interpretable semi-parametric regression methods for large-scale applications that take into account non-linearity in the data, the complexity of the models is still often limited. One of the main challenges is the absence of interactions in these models, which are left out for the sake of better interpretability but also due to impractical computational costs. To overcome this limitation, we propose a new approach using a factorization method to derive a highly scalable higher-order tensor product spline model. Our method allows for the incorporation of all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We further develop a meaningful penalization scheme and examine the induced optimization problem. We conclude by evaluating the predictive and estimation performance of our method.
Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on non-exchangeable conformal prediction, which still ensures bounds on coverage. The result, non-exchangeable conformal nucleus sampling, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees.
We propose GAN-Supervised Learning, a framework for learning discriminative models and their GAN-generated training data jointly end-to-end. We apply our framework to the dense visual alignment problem. Inspired by the classic Congealing method, our GANgealing algorithm trains a Spatial Transformer to map random samples from a GAN trained on unaligned data to a common, jointly-learned target mode. We show results on eight datasets, all of which demonstrate our method successfully aligns complex data and discovers dense correspondences. GANgealing significantly outperforms past self-supervised correspondence algorithms and performs on-par with (and sometimes exceeds) state-of-the-art supervised correspondence algorithms on several datasets -- without making use of any correspondence supervision or data augmentation and despite being trained exclusively on GAN-generated data. For precise correspondence, we improve upon state-of-the-art supervised methods by as much as $3\times$. We show applications of our method for augmented reality, image editing and automated pre-processing of image datasets for downstream GAN training.
This paper aims to mitigate straggler effects in synchronous distributed learning for multi-agent reinforcement learning (MARL) problems. Stragglers arise frequently in a distributed learning system, due to the existence of various system disturbances such as slow-downs or failures of compute nodes and communication bottlenecks. To resolve this issue, we propose a coded distributed learning framework, which speeds up the training of MARL algorithms in the presence of stragglers, while maintaining the same accuracy as the centralized approach. As an illustration, a coded distributed version of the multi-agent deep deterministic policy gradient(MADDPG) algorithm is developed and evaluated. Different coding schemes, including maximum distance separable (MDS)code, random sparse code, replication-based code, and regular low density parity check (LDPC) code are also investigated. Simulations in several multi-robot problems demonstrate the promising performance of the proposed framework.