Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.
Memory retention mechanisms play a central role in determining the efficiency of computational architectures designed for processing extended sequences. Conventional methods for token management often impose fixed retention thresholds or rely on uniform attention weight distributions, leading to inefficient memory utilization and premature information loss in extended sequence modeling. Structured Token Retention (STR) introduces a probabilistic selection framework that dynamically adjusts token persistence based on contextual significance, ensuring that computational resources are allocated to semantically relevant elements. Computational Memory Paths (CMP) extend this framework through hierarchical memory allocation, refining retention efficiency through structured reallocation of token embeddings. Comparative assessments against baseline models demonstrate that STR and CMP improve token survival rates across long input sequences while reducing cumulative error propagation across processing layers. Experimental results further indicate reductions in computational overhead, improving inference speed without degrading contextual coherence. Token distribution analyses reveal that structured memory allocation prevents excessive redundancy in attention weight calculations, optimizing information retrieval efficiency in large-scale generative architectures. The integration of STR and CMP into an open-source model illustrates the adaptability of structured memory retention methodologies, highlighting their applicability in generative text processing, long-context comprehension, and scalable sequence modeling.
Adversarial training can be used to learn models that are robust against perturbations. For linear models, it can be formulated as a convex optimization problem. Compared to methods proposed in the context of deep learning, leveraging the optimization structure allows significantly faster convergence rates. Still, the use of generic convex solvers can be inefficient for large-scale problems. Here, we propose tailored optimization algorithms for the adversarial training of linear models, which render large-scale regression and classification problems more tractable. For regression problems, we propose a family of solvers based on iterative ridge regression and, for classification, a family of solvers based on projected gradient descent. The methods are based on extended variable reformulations of the original problem. We illustrate their efficiency in numerical examples.
Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
As deep generative models have progressed, recent work has shown them to be capable of memorizing and reproducing training datapoints when deployed. These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization. To better understand this phenomenon, we propose the manifold memorization hypothesis (MMH), a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization. We propose to analyze memorization in terms of the relationship between the dimensionalities of (i) the ground truth data manifold and (ii) the manifold learned by the model. This framework provides a formal standard for "how memorized" a datapoint is and systematically categorizes memorized data into two types: memorization driven by overfitting and memorization driven by the underlying data distribution. By analyzing prior work in the context of the MMH, we explain and unify assorted observations in the literature. We empirically validate the MMH using synthetic data and image datasets up to the scale of Stable Diffusion, developing new tools for detecting and preventing generation of memorized samples in the process.
As hyperconnected devices and decentralized data architectures expand, securing IoT transactions becomes increasingly challenging. Blockchain offers a promising solution, but its effectiveness relies on the underlying consensus algorithm. Traditional mechanisms like PoW and PoS are often impractical for resource-constrained IoT environments. To address these limitations, this work introduces a fair and lightweight hybrid consensus algorithm tailored for IoT. The proposed approach minimizes resource demands on the nodes while ensuring a secure and fair agreement process. Specifically, it leverages a distributed lottery mechanism to fairly propose blocks without requiring specialized hardware. In addition, a reputation-based block voting mechanism is incorporated to enhance trust and establish finality. Finally, experimental evaluation was conducted to validate the key features of the consensus algorithm.
The estimation of regression parameters in one dimensional broken stick models is a research area of statistics with an extensive literature. We are interested in extending such models by aiming to recover two or more intersecting (hyper)planes in multiple dimensions. In contrast to approaches aiming to recover a given number of piecewise linear components using either a grid search or local smoothing around the change points, we show how to use Nesterov smoothing to obtain a smooth and everywhere differentiable approximation to a piecewise linear regression model with a uniform error bound. The parameters of the smoothed approximation are then efficiently found by minimizing a least squares objective function using a quasi-Newton algorithm. Our main contribution is threefold: We show that the estimates of the Nesterov smoothed approximation of the broken plane model are also $\sqrt{n}$ consistent and asymptotically normal, where $n$ is the number of data points on the two planes. Moreover, we show that as the degree of smoothing goes to zero, the smoothed estimates converge to the unsmoothed estimates and present an algorithm to perform parameter estimation. We conclude by presenting simulation results on simulated data together with some guidance on suitable parameter choices for practical applications.
The perceived advantage of machine learning (ML) models is that they are flexible and can incorporate a large number of features. However, many of these are typically correlated or dependent, and incorporating all of them can hinder model stability and generalizability. In fact, it is desirable to do some form of feature screening and incorporate only the relevant features. The best approaches should involve subject-matter knowledge and information on causal relationships. This paper deals with an approach called Markov boundary (MB) that is related to causal discovery, using directed acyclic graphs to represent potential relationships and using statistical tests to determine the connections. An MB is the minimum set of features that guarantee that other potential predictors do not affect the target given the boundary while ensuring maximal predictive accuracy. Identifying the Markov boundary is straightforward under assumptions of Gaussianity on the features and linear relationships between them. But these assumptions are not satisfied in practice. This paper outlines common problems associated with identifying the Markov boundary in structured data when relationships are non-linear and the predictors are of mixed data type. We propose a multi-group forward-backward selection strategy that addresses these challenges and demonstrate its capabilities on simulated and real datasets.
General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or fine-tuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is //github.com/fanq15/Few-Shot-Object-Detection-Dataset.
We advocate the use of implicit fields for learning generative models of shapes and introduce an implicit field decoder for shape generation, aimed at improving the visual quality of the generated shapes. An implicit field assigns a value to each point in 3D space, so that a shape can be extracted as an iso-surface. Our implicit field decoder is trained to perform this assignment by means of a binary classifier. Specifically, it takes a point coordinate, along with a feature vector encoding a shape, and outputs a value which indicates whether the point is outside the shape or not. By replacing conventional decoders by our decoder for representation learning and generative modeling of shapes, we demonstrate superior results for tasks such as shape autoencoding, generation, interpolation, and single-view 3D reconstruction, particularly in terms of visual quality.