Deep learning models, while achieving state-of-the-art performance on many tasks, are susceptible to adversarial attacks that exploit inherent vulnerabilities in their architectures. Adversarial attacks manipulate the input data with imperceptible perturbations, causing the model to misclassify the data or produce erroneous outputs. This work is based on enhancing the robustness of targeted classifier models against adversarial attacks. To achieve this, an convolutional autoencoder-based approach is employed that effectively counters adversarial perturbations introduced to the input images. By generating images closely resembling the input images, the proposed methodology aims to restore the model's accuracy.
We investigate the applicability of machine learning technologies to the development of parsimonious, interpretable, catchment-scale hydrologic models using directed-graph architectures based on the mass-conserving perceptron (MCP) as the fundamental computational unit. Here, we focus on architectural complexity (depth) at a single location, rather than universal applicability (breadth) across large samples of catchments. The goal is to discover a minimal representation (numbers of cell-states and flow paths) that represents the dominant processes that can explain the input-state-output behaviors of a given catchment, with particular emphasis given to simulating the full range (high, medium, and low) of flow dynamics. We find that a HyMod-like architecture with three cell-states and two major flow pathways achieves such a representation at our study location, but that the additional incorporation of an input-bypass mechanism significantly improves the timing and shape of the hydrograph, while the inclusion of bi-directional groundwater mass exchanges significantly enhances the simulation of baseflow. Overall, our results demonstrate the importance of using multiple diagnostic metrics for model evaluation, while highlighting the need for designing training metrics that are better suited to extracting information across the full range of flow dynamics. Further, they set the stage for interpretable regional-scale MCP-based hydrological modeling (using large sample data) by using neural architecture search to determine appropriate minimal representations for catchments in different hydroclimatic regimes.
People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying "excuse me" to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalities or social situations, while data-driven methods require specialized datasets for each social situation the robot is used in. We propose to leverage the rich social context available from large language models (LLMs) and their ability to generate motion based on instructions or user preferences, to generate expressive robot motion that is adaptable and composable, building upon each other. Our approach utilizes few-shot chain-of-thought prompting to translate human language instructions into parametrized control code using the robot's available and learned skills. Through user studies and simulation experiments, we demonstrate that our approach produces behaviors that users found to be competent and easy to understand. Supplementary material can be found at //generative-expressive-motion.github.io/.
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
Modelling learning objects (LO) within their context enables the learner to advance from a basic, remembering-level, learning objective to a higher-order one, i.e., a level with an application- and analysis objective. While hierarchical data models are commonly used in digital learning platforms, using graph-based models enables representing the context of LOs in those platforms. This leads to a foundation for personalized recommendations of learning paths. In this paper, the transformation of hierarchical data models into knowledge graph (KG) models of LOs using text mining is introduced and evaluated. We utilize custom text mining pipelines to mine semantic relations between elements of an expert-curated hierarchical model. We evaluate the KG structure and relation extraction using graph quality-control metrics and the comparison of algorithmic semantic-similarities to expert-defined ones. The results show that the relations in the KG are semantically comparable to those defined by domain experts, and that the proposed KG improves representing and linking the contexts of LOs through increasing graph communities and betweenness centrality.
Influential diagnosis is an integral part of data analysis, of which most existing methodological frameworks presume a deterministic submodel and are designed for low-dimensional data (i.e., the number of predictors p smaller than the sample size n). However, the stochastic selection of a submodel from high-dimensional data where p exceeds n has become ubiquitous. Thus, methods for identifying observations that could exert undue influence on the choice of a submodel can play an important role in this setting. To date, discussion of this topic has been limited, falling short in two domains: (i) constrained ability to detect multiple influential points, and (ii) applicability only in restrictive settings. After describing the problem, we characterize and formalize the concept of influential observations on variable selection. Then, we propose a generalized diagnostic measure, extended from an available metric accommodating different model selectors and multiple influential observations, the asymptotic distribution of which is subsequently establish large p, thus providing guidelines to ascertain influential observations. A high-dimensional clustering procedure is further incorporated into our proposed scheme to detect multiple influential points. Simulation is conducted to assess the performances of various diagnostic approaches. The proposed procedure further demonstrates its value in improving predictive power when analyzing thermal-stimulated pain based on fMRI data.
Despite the recent progress in deep learning, most approaches still go for a silo-like solution, focusing on learning each task in isolation: training a separate neural network for each individual task. Many real-world problems, however, call for a multi-modal approach and, therefore, for multi-tasking models. Multi-task learning (MTL) aims to leverage useful information across tasks to improve the generalization capability of a model. This thesis is concerned with multi-task learning in the context of computer vision. First, we review existing approaches for MTL. Next, we propose several methods that tackle important aspects of multi-task learning. The proposed methods are evaluated on various benchmarks. The results show several advances in the state-of-the-art of multi-task learning. Finally, we discuss several possibilities for future work.
Causality knowledge is vital to building robust AI systems. Deep learning models often perform poorly on tasks that require causal reasoning, which is often derived using some form of commonsense knowledge not immediately available in the input but implicitly inferred by humans. Prior work has unraveled spurious observational biases that models fall prey to in the absence of causality. While language representation models preserve contextual knowledge within learned embeddings, they do not factor in causal relationships during training. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video question-answering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Recently, several models have been proposed that have tackled the task of mining causal data from either the visual or textual modality. However, there does not exist widespread research that mines causal relationships by juxtaposing the visual and language modalities. While images offer a rich and easy-to-process resource for us to mine causality knowledge from, videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. We propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions. Furthermore, iReason's architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. We demonstrate the effectiveness of iReason using a two-pronged comparative analysis with language representation learning models (BERT, GPT-2) as well as current state-of-the-art multimodal causality models.
Data augmentation has been widely used to improve generalizability of machine learning models. However, comparatively little work studies data augmentation for graphs. This is largely due to the complex, non-Euclidean structure of graphs, which limits possible manipulation operations. Augmentation operations commonly used in vision and language have no analogs for graphs. Our work studies graph data augmentation for graph neural networks (GNNs) in the context of improving semi-supervised node-classification. We discuss practical and theoretical motivations, considerations and strategies for graph data augmentation. Our work shows that neural edge predictors can effectively encode class-homophilic structure to promote intra-class edges and demote inter-class edges in given graph structure, and our main contribution introduces the GAug graph data augmentation framework, which leverages these insights to improve performance in GNN-based node classification via edge prediction. Extensive experiments on multiple benchmarks show that augmentation via GAug improves performance across GNN architectures and datasets.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.
While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.