In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this article, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. In particular, our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell type probabilities as a function of a cell's gene expression profile (i.e., cell type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell type annotation methodology.
Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.
We present a method for computing nearly singular integrals that occur when single or double layer surface integrals, for harmonic potentials or Stokes flow, are evaluated at points nearby. Such values could be needed in solving an integral equation when one surface is close to another or to obtain values at grid points. We replace the singular kernel with a regularized version having a length parameter $\delta$ in order to control discretization error. Analysis near the singularity leads to an expression for the error due to regularization which has terms with unknown coefficients multiplying known quantities. By computing the integral with three choices of $\delta$ we can solve for an extrapolated value that has regularization error reduced to $O(\delta^5)$. In examples with $\delta/h$ constant and moderate resolution we observe total error about $O(h^5)$. For convergence as $h \to 0$ we can choose $\delta$ proportional to $h^q$ with $q < 1$ to ensure the discretization error is dominated by the regularization error. With $q = 4/5$ we find errors about $O(h^4)$. For harmonic potentials we extend the approach to a version with $O(\delta^7)$ regularization; it typically has smaller errors but the order of accuracy is less predictable.
Our research proposes a novel method for reducing the dimensionality of functional data, specifically for the case where the response is a scalar and the predictor is a random function. Our method utilizes distance covariance, and has several advantages over existing methods. Unlike current techniques which require restrictive assumptions such as linear conditional mean and constant covariance, our method has mild requirements on the predictor. Additionally, our method does not involve the use of the unbounded inverse of the covariance operator. The link function between the response and predictor can be arbitrary, and our proposed method maintains the advantage of being model-free, without the need to estimate the link function. Furthermore, our method is naturally suited for sparse longitudinal data. We utilize functional principal component analysis with truncation as a regularization mechanism in the development of our method. We provide justification for the validity of our proposed method, and establish statistical consistency of the estimator under certain regularization conditions. To demonstrate the effectiveness of our proposed method, we conduct simulation studies and real data analysis. The results show improved performance compared to existing methods.
We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm for recovering the clusters of variables without specifying the number of clusters \emph{a priori}. Our work provides some theoretical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. This implies that groups can be learned nonparametrically in which block maxima of a dependent process are only sub-asymptotic. To further illustrate the significance of our work, we applied our method to neuroscience and environmental real-datasets. These applications highlight the potential and versatility of the proposed approach.
Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
Discovering causal relationships from observational data is a fundamental yet challenging task. In some applications, it may suffice to learn the causal features of a given response variable, instead of learning the entire underlying causal structure. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings. ICP assumes that the mechanism for generating the response from its direct causes is the same in all settings and exploits this invariance to output a subset of the causal features. The framework of ICP has been extended to general additive noise models and to nonparametric settings using conditional independence testing. However, nonparametric conditional independence testing often suffers from low power (or poor type I error control) and the aforementioned parametric models are not suitable for applications in which the response is not measured on a continuous scale, but rather reflects categories or counts. To bridge this gap, we develop ICP in the context of transformation models (TRAMs), allowing for continuous, categorical, count-type, and uninformatively censored responses (we show that, in general, these model classes do not allow for identifiability when there is no exogenous heterogeneity). We propose TRAM-GCM, a test for invariance of a subset of covariates, based on the expected conditional covariance between environments and score residuals which satisfies uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we propose an additional invariance test, TRAM-Wald, based on the Wald statistic. We implement both proposed methods in the open-source R package "tramicp" and show in simulations that under the correct model specification, our approach empirically yields higher power than nonparametric ICP based on conditional independence testing.
A rigidity circuit (in 2D) is a minimal dependent set in the rigidity matroid, i.e. a minimal graph supporting a non-trivial stress in any generic placement of its vertices in $\mathbb R^2$. Any rigidity circuit on $n\geq 5$ vertices can be obtained from rigidity circuits on a fewer number of vertices by applying the combinatorial resultant (CR) operation. The inverse operation is called a combinatorial resultant decomposition (CR-decomp). Any rigidity circuit on $n\geq 5$ vertices can be successively decomposed into smaller circuits, until the complete graphs $K_4$ are reached. This sequence of CR-decomps has the structure of a rooted binary tree called the combinatorial resultant tree (CR-tree). A CR-tree encodes an elimination strategy for computing circuit polynomials via Sylvester resultants. Different CR-trees lead to elimination strategies that can vary greatly in time and memory consumption. It is an open problem to establish criteria for optimal CR-trees, or at least to characterize those CR-trees that lead to good elimination strategies. In [12] we presented an algorithm for enumerating CR-trees where we give the algorithms for decomposing 3-connected rigidity circuits in polynomial time. In this paper we focus on those circuits that are not 3-connected, which we simply call 2-connected. In order to enumerate CR-decomps of 2-connected circuits $G$, a brute force exp-time search has to be performed among the subgraphs induced by the subsets of $V(G)$. This exp-time bottleneck is not present in the 3-connected case. In this paper we will argue that we do not have to account for all possible CR-decomps of 2-connected rigidity circuits to find a good elimination strategy; we only have to account for those CR-decomps that are a 2-split, all of which can be enumerated in polynomial time. We present algorithms and computational evidence in support of this heuristic.
In surgery, the application of appropriate force levels is critical for the success and safety of a given procedure. While many studies are focused on measuring in situ forces, little attention has been devoted to relating these observed forces to surgical techniques. Answering questions like "Can certain changes to a surgical technique result in lower forces and increased safety margins?" could lead to improved surgical practice, and importantly, patient outcomes. However, such studies would require a large number of trials and professional surgeons, which is generally impractical to arrange. Instead, we show how robots can learn several variations of a surgical technique from a smaller number of surgical demonstrations and interpolate learnt behaviour via a parameterised skill model. This enables a large number of trials to be performed by a robotic system and the analysis of surgical techniques and their downstream effects on tissue. Here, we introduce a parameterised model of the elliptical excision skill and apply a Bayesian optimisation scheme to optimise the excision behaviour with respect to expert ratings, as well as individual characteristics of excision forces. Results show that the proposed framework can successfully align the generated robot behaviour with subjects across varying levels of proficiency in terms of excision forces.
Neuromorphic computing is one of the few current approaches that have the potential to significantly reduce power consumption in Machine Learning and Artificial Intelligence. Imam & Cleland presented an odour-learning algorithm that runs on a neuromorphic architecture and is inspired by circuits described in the mammalian olfactory bulb. They assess the algorithm's performance in "rapid online learning and identification" of gaseous odorants and odorless gases (short "gases") using a set of gas sensor recordings of different odour presentations and corrupting them by impulse noise. We replicated parts of the study and discovered limitations that affect some of the conclusions drawn. First, the dataset used suffers from sensor drift and a non-randomised measurement protocol, rendering it of limited use for odour identification benchmarks. Second, we found that the model is restricted in its ability to generalise over repeated presentations of the same gas. We demonstrate that the task the study refers to can be solved with a simple hash table approach, matching or exceeding the reported results in accuracy and runtime. Therefore, a validation of the model that goes beyond restoring a learned data sample remains to be shown, in particular its suitability to odour identification tasks.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.