Accurately selecting and estimating smooth functional effects in additive models with potentially many functions is a challenging task. We introduce a novel Demmler-Reinsch basis expansion to model the functional effects that allows us to orthogonally decompose an effect into its linear and nonlinear parts. We show that our representation allows to consistently estimate both parts as opposed to commonly employed mixed model representations. Equipping the reparameterized regression coefficients with normal beta prime spike and slab priors allows us to determine whether a continuous covariate has a linear, a nonlinear or no effect at all. We provide new theoretical results for the prior and a compelling explanation for its superior Markov chain Monte Carlo mixing performance compared to the spike-and-slab group lasso. We establish an efficient posterior estimation scheme and illustrate our approach along effect selection on the hazard rate of a time-to-event response in the geoadditive Cox regression model in simulations and data on survival with leukemia.
This thesis explores the generation of local explanations for already deployed machine learning models, aiming to identify optimal conditions for producing meaningful explanations considering both data and user requirements. The primary goal is to develop methods for generating explanations for any model while ensuring that these explanations remain faithful to the underlying model and comprehensible to the users. The thesis is divided into two parts. The first enhances a widely used rule-based explanation method. It then introduces a novel approach for evaluating the suitability of linear explanations to approximate a model. Additionally, it conducts a comparative experiment between two families of counterfactual explanation methods to analyze the advantages of one over the other. The second part focuses on user experiments to assess the impact of three explanation methods and two distinct representations. These experiments measure how users perceive their interaction with the model in terms of understanding and trust, depending on the explanations and representations. This research contributes to a better explanation generation, with potential implications for enhancing the transparency, trustworthiness, and usability of deployed AI systems.
Self-distillation (SD) is the process of training a student model using the outputs of a teacher model, with both models sharing the same architecture. Our study theoretically examines SD in multi-class classification with cross-entropy loss, exploring both multi-round SD and SD with refined teacher outputs, inspired by partial label learning (PLL). By deriving a closed-form solution for the student model's outputs, we discover that SD essentially functions as label averaging among instances with high feature correlations. Initially beneficial, this averaging helps the model focus on feature clusters correlated with a given instance for predicting the label. However, it leads to diminishing performance with increasing distillation rounds. Additionally, we demonstrate SD's effectiveness in label noise scenarios and identify the label corruption condition and minimum number of distillation rounds needed to achieve 100% classification accuracy. Our study also reveals that one-step distillation with refined teacher outputs surpasses the efficacy of multi-step SD using the teacher's direct output in high noise rate regimes.
Robotic manipulation of deformable linear objects (DLOs) is an active area of research, though emerging applications, like automotive wire harness installation, introduce constraints that have not been considered in prior work. Confined workspaces and limited visibility complicate prior assumptions of multi-robot manipulation and direct measurement of DLO configuration (state). This work focuses on single-arm manipulation of stiff DLOs (StDLOs) connected to form a DLO network (DLON), for which the measurements (output) are the endpoint poses of the DLON, which are subject to unknown dynamics during manipulation. To demonstrate feasibility of output-based control without state estimation, direct input-output dynamics are shown to exist by training neural network models on simulated trajectories. Output dynamics are then approximated with polynomials and found to contain well-known rigid body dynamics terms. A composite model consisting of a rigid body model and an online data-driven residual is developed, which predicts output dynamics more accurately than either model alone, and without prior experience with the system. An adaptive model predictive controller is developed with the composite model for DLON manipulation, which completes DLON installation tasks, both in simulation and with a physical automotive wire harness.
Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.
Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image to a set of concept tokens and treat them as the condition of the latent diffusion for image reconstruction, where cross-attention over the concept tokens is used to bridge the interaction between the encoder and diffusion. Without any additional regularization, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analysis, shedding light on the functioning of this model. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs. We anticipate that our findings will inspire more investigation on exploring diffusion for disentangled representation learning towards more sophisticated data analysis and understanding.
In research of manufacturing systems and autonomous robots, the term capability is used for a machine-interpretable specification of a system function. Approaches in this research area develop information models that capture all information relevant to interpret the requirements, effects and behavior of functions. These approaches are intended to overcome the heterogeneity resulting from the various types of processes and from the large number of different vendors. However, these models and associated methods do not offer solutions for automated process planning, i.e. finding a sequence of individual capabilities required to manufacture a certain product or to accomplish a mission using autonomous robots. Instead, this is a typical task for AI planning approaches, which unfortunately require a high effort to create the respective planning problem descriptions. In this paper, we present an approach that combines these two topics: Starting from a semantic capability model, an AI planning problem is automatically generated. The planning problem is encoded using Satisfiability Modulo Theories and uses an existing solver to find valid capability sequences including required parameter values. The approach also offers possibilities to integrate existing human expertise and to provide explanations for human operators in order to help understand planning decisions.
Prior studies on the emergence in large models have primarily focused on how the functional capabilities of large language models (LLMs) scale with model size. Our research, however, transcends this traditional paradigm, aiming to deepen our understanding of the emergence within LLMs by placing a special emphasis not just on the model size but more significantly on the complex behavior of neuron interactions during the training process. By introducing the concepts of "self-organization" and "multifractal analysis," we explore how neuron interactions dynamically evolve during training, leading to "emergence," mirroring the phenomenon in natural systems where simple micro-level interactions give rise to complex macro-level behaviors. To quantitatively analyze the continuously evolving interactions among neurons in large models during training, we propose the Neuron-based Multifractal Analysis (NeuroMFA). Utilizing NeuroMFA, we conduct a comprehensive examination of the emergent behavior in LLMs through the lens of both model size and training process, paving new avenues for research into the emergence in large models.
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.
While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.