Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models. To address this, we propose our method, TrIm, Elect Sign & Merge (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of resolving sign interference. Our code is available at //github.com/prateeky2806/ties-merging
Owing to the nature of privacy protection, federated recommender systems (FedRecs) have garnered increasing interest in the realm of on-device recommender systems. However, most existing FedRecs only allow participating clients to collaboratively train a recommendation model of the same public parameter size. Training a model of the same size for all clients can lead to suboptimal performance since clients possess varying resources. For example, clients with limited training data may prefer to train a smaller recommendation model to avoid excessive data consumption, while clients with sufficient data would benefit from a larger model to achieve higher recommendation accuracy. To address the above challenge, this paper introduces HeteFedRec, a novel FedRec framework that enables the assignment of personalized model sizes to participants. In HeteFedRec, we present a heterogeneous recommendation model aggregation strategy, including a unified dual-task learning mechanism and a dimensional decorrelation regularization, to allow knowledge aggregation among recommender models of different sizes. Additionally, a relation-based ensemble knowledge distillation method is proposed to effectively distil knowledge from heterogeneous item embeddings. Extensive experiments conducted on three real-world recommendation datasets demonstrate the effectiveness and efficiency of HeteFedRec in training federated recommender systems under heterogeneous settings.
The $k$-tensor Ising model is an exponential family on a $p$-dimensional binary hypercube for modeling dependent binary data, where the sufficient statistic consists of all $k$-fold products of the observations, and the parameter is an unknown $k$-fold tensor, designed to capture higher-order interactions between the binary variables. In this paper, we describe an approach based on a penalization technique that helps us recover the signed support of the tensor parameter with high probability, assuming that no entry of the true tensor is too close to zero. The method is based on an $\ell_1$-regularized node-wise logistic regression, that recovers the signed neighborhood of each node with high probability. Our analysis is carried out in the high-dimensional regime, that allows the dimension $p$ of the Ising model, as well as the interaction factor $k$ to potentially grow to $\infty$ with the sample size $n$. We show that if the minimum interaction strength is not too small, then consistent recovery of the entire signed support is possible if one takes $n = \Omega((k!)^8 d^3 \log \binom{p-1}{k-1})$ samples, where $d$ denotes the maximum degree of the hypernetwork in question. Our results are validated in two simulation settings, and applied on a real neurobiological dataset consisting of multi-array electro-physiological recordings from the mouse visual cortex, to model higher-order interactions between the brain regions.
Federated learning (FL) has emerged as a key technique for distributed machine learning (ML). Most literature on FL has focused on ML model training for (i) a single task/model, with (ii) a synchronous scheme for uplink/downlink transfer of model parameters, and (iii) a static data distribution setting across devices. These assumptions are often not well representative of conditions encountered in practical FL environments. To address this, we develop DMA-FL, which considers dynamic FL with multiple downstream tasks to be trained over an asynchronous model transmission architecture. We first characterize the convergence of ML model training under DMA-FL via introducing a family of scheduling tensors and rectangular functions to capture the scheduling of devices. Our convergence analysis sheds light on the impact of resource allocation, device scheduling, and individual model states on the performance of ML models. We then formulate a non-convex mixed integer optimization problem for jointly configuring the resource allocation and device scheduling to strike an efficient trade-off between energy consumption and ML performance. We develop a solution methodology employing successive convex approximations with convergence guarantee to a stationary point. Through numerical simulations, we reveal the advantages of DMA-FL in terms of model performance and network resource savings.
Dimension reduction techniques have long been an important topic in statistics, and active subspaces (AS) have received much attention this past decade in the computer experiments literature. The most common approach towards estimating the AS is to use Monte Carlo with numerical gradient evaluation. While sensible in some settings, this approach has obvious drawbacks. Recent research has demonstrated that active subspace calculations can be obtained in closed form, conditional on a Gaussian process (GP) surrogate, which can be limiting in high-dimensional settings for computational reasons. In this paper, we produce the relevant calculations for a more general case when the model of interest is a linear combination of tensor products. These general equations can be applied to the GP, recovering previous results as a special case, or applied to the models constructed by other regression techniques including multivariate adaptive regression splines (MARS). Using a MARS surrogate has many advantages including improved scaling, better estimation of active subspaces in high dimensions and the ability to handle a large number of prior distributions in closed form. In one real-world example, we obtain the active subspace of a radiation-transport code with 240 inputs and 9,372 model runs in under half an hour.
Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the complexity of transformer models is quadratic with respect to the number of input tokens. Therefore techniques that reduce the number of input tokens that need to be processed have been proposed. This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task. Our results demonstrate that LTMP achieves state-of-the-art accuracy across reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods. Code is available at //github.com/Mxbonn/ltmp .
Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic. Two general and often independent lines of work exist, one focuses on addressing noisy labels, and another deals with hard samples. However, when both types of data are present, most existing methods treat them equally, which results in a decline in the overall performance of the model. In this paper, we first design various synthetic datasets with custom hardness and noisiness levels for different samples. Our proposed systematic empirical study enables us to better understand the similarities and more importantly the differences between hard-to-learn samples and incorrectly-labeled samples. These controlled experiments pave the way for the development of methods that distinguish between hard and noisy samples. Through our study, we introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples. We study various data partitioning methods in the presence of label noise and observe that filtering out noisy samples from hard samples with this proposed metric results in the best datasets as evidenced by the high test accuracy achieved after models are trained on the filtered datasets. We demonstrate this for both our created synthetic datasets and for datasets with real-world label noise. Furthermore, our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
This work proposes novel techniques for the efficient numerical simulation of parameterized, unsteady partial differential equations. Projection-based reduced order models (ROMs) such as the reduced basis method employ a (Petrov-)Galerkin projection onto a linear low-dimensional subspace. In unsteady applications, space-time reduced basis (ST-RB) methods have been developed to achieve a dimension reduction both in space and time, eliminating the computational burden of time marching schemes. However, nonaffine parameterizations dilute any computational speedup achievable by traditional ROMs. Computational efficiency can be recovered by linearizing the nonaffine operators via hyper-reduction, such as the empirical interpolation method in matrix form. In this work, we implement new hyper-reduction techniques explicitly tailored to deal with unsteady problems and embed them in a ST-RB framework. For each of the proposed methods, we develop a posteriori error bounds. We run numerical tests to compare the performance of the proposed ROMs against high-fidelity simulations, in which we combine the finite element method for space discretization on 3D geometries and the Backward Euler time integrator. In particular, we consider a heat equation and an unsteady Stokes equation. The numerical experiments demonstrate the accuracy and computational efficiency our methods retain with respect to the high-fidelity simulations.
As a distributed machine learning technique, federated learning (FL) requires clients to collaboratively train a shared model with an edge server without leaking their local data. However, the heterogeneous data distribution among clients often leads to a decrease in model performance. To tackle this issue, this paper introduces a prototype-based regularization strategy to address the heterogeneity in the data distribution. Specifically, the regularization process involves the server aggregating local prototypes from distributed clients to generate a global prototype, which is then sent back to the individual clients to guide their local training. The experimental results on MNIST and Fashion-MNIST show that our proposal achieves improvements of 3.3% and 8.9% in average test accuracy, respectively, compared to the most popular baseline FedAvg. Furthermore, our approach has a fast convergence rate in heterogeneous settings.
Since distribution shifts are common in real-world applications, there is a pressing need for developing prediction models that are robust against such shifts. Existing frameworks, such as empirical risk minimization or distributionally robust optimization, either lack generalizability for unseen distributions or rely on postulated distance measures. Alternatively, causality offers a data-driven and structural perspective to robust predictions. However, the assumptions necessary for causal inference can be overly stringent, and the robustness offered by such causal models often lacks flexibility. In this paper, we focus on causality-oriented robustness and propose Distributional Robustness via Invariant Gradients (DRIG), a method that exploits general additive interventions in training data for robust predictions against unseen interventions, and naturally interpolates between in-distribution prediction and causality. In a linear setting, we prove that DRIG yields predictions that are robust among a data-dependent class of distribution shifts. Furthermore, we show that our framework includes anchor regression (Rothenh\"ausler et al.\ 2021) as a special case, and that it yields prediction models that protect against more diverse perturbations. We extend our approach to the semi-supervised domain adaptation setting to further improve prediction performance. Finally, we empirically validate our methods on synthetic simulations and on single-cell data.
Analyzing observational data from multiple sources can be useful for increasing statistical power to detect a treatment effect; however, practical constraints such as privacy considerations may restrict individual-level information sharing across data sets. This paper develops federated methods that only utilize summary-level information from heterogeneous data sets. Our federated methods provide doubly-robust point estimates of treatment effects as well as variance estimates. We derive the asymptotic distributions of our federated estimators, which are shown to be asymptotically equivalent to the corresponding estimators from the combined, individual-level data. We show that to achieve these properties, federated methods should be adjusted based on conditions such as whether models are correctly specified and stable across heterogeneous data sets.