One of the bottlenecks of automated driving technologies is safe and socially acceptable interactions with human-driven vehicles, for example during merging. Driver models that provide accurate predictions of joint and individual driver behaviour of high-level decisions, safety margins, and low-level control inputs are required to improve the interactive capabilities of automated driving. Existing driver models typically focus on one of these aspects. Unified models capturing all aspects are missing which hinders understanding of the principles that govern human traffic interactions. This in turn limits the ability of automated vehicles to resolve merging interactions. Here, we present a communication-enabled interaction model based on risk perception with the potential to capture merging interactions on all three levels. Our model accurately describes human behaviour in a simplified merging scenario, addressing both individual actions (such as velocity adjustments) and joint actions (such as the order of merging). Contrary to other interaction models, our model does not assume humans are rational and explicitly accounts for communication between drivers. Our results demonstrate that communication and risk-based decision-making explain observed human interactions on multiple levels. This explanation improves our understanding of the underlying mechanisms of human traffic interactions and poses a step towards interaction-aware automated driving.
We present a family of automata networks that solve the k-parity problem when run in parallel. These solutions are constructed by connecting cliques in a non-cyclical fashion. The size of the local neighbourhood is linear in the size of the alphabet, and the convergence time is proven to always be the diameter of the interaction graph. We show that this family of solutions can be slightly altered to obtain an equivalent family of solutions to the k-synchronisation problem, which means that these solutions converge from any initial configuration to the cycle which contains all the uniform configurations over the alphabet, in order.
In a world of increasing closed-source commercial machine learning models, model evaluations from developers must be taken at face value. These benchmark results, whether over task accuracy, bias evaluations, or safety checks, are traditionally impossible to verify by a model end-user without the costly or impossible process of re-performing the benchmark on black-box model outputs. This work presents a method of verifiable model evaluation using model inference through zkSNARKs. The resulting zero-knowledge computational proofs of model outputs over datasets can be packaged into verifiable evaluation attestations showing that models with fixed private weights achieve stated performance or fairness metrics over public inputs. These verifiable attestations can be performed on any standard neural network model with varying compute requirements. For the first time, we demonstrate this across a sample of real-world models and highlight key challenges and design solutions. This presents a new transparency paradigm in the verifiable evaluation of private models.
Flow-based models are widely used in generative tasks, including normalizing flow, where a neural network transports from a data distribution $P$ to a normal distribution. This work develops a flow-based model that transports from $P$ to an arbitrary $Q$ where both distributions are only accessible via finite samples. We propose to learn the dynamic optimal transport between $P$ and $Q$ by training a flow neural network. The model is trained to optimally find an invertible transport map between $P$ and $Q$ by minimizing the transport cost. The trained optimal transport flow subsequently allows for performing many downstream tasks, including infinitesimal density ratio estimation (DRE) and distribution interpolation in the latent space for generative models. The effectiveness of the proposed model on high-dimensional data is demonstrated by strong empirical performance on high-dimensional DRE, OT baselines, and image-to-image translation.
A Gaussian process is proposed as a model for the posterior distribution of the local predictive ability of a model or expert, conditional on a vec- tor of covariates, from historical predictions in the form of log predictive scores. Assuming Gaussian expert predictions and a Gaussian data generat- ing process, a linear transformation of the predictive score follows a noncen- tral chi-squared distribution with one degree of freedom. Motivated by this we develop a non-central chi-squared Gaussian process regression to flexibly model local predictive ability, with the posterior distribution of the latent GP function and kernel hyperparameters sampled by Hamiltonian Monte Carlo. We show that a cube-root transformation of the log scores is approximately Gaussian with homoscedastic variance, which makes it possible to estimate the model much faster by marginalizing the latent GP function analytically. Linear pools based on learned local predictive ability are applied to predict daily bike usage in Washington DC.
Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or open source (OpenAI's Whisper). Such ASR models are trained on hundreds of thousands of hours of audio data and require considerable resources to run. Despite recent progress large speech recognition models, we highlight the potential of smaller, specialized "micro" models. Such light models can be trained perform well on number recognition specific tasks, competing with general models like Whisper or Google STT while using less than 80 minutes of training time and occupying at least an order of less memory resources. Also, unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets, which makes them highly accurate, agile, and easy to retrain, while using low compute resources. We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data. An added advantage, their low resource consumption allows them to be hosted on-premise, keeping private data local instead uploading to an external cloud. Our results indicate that our micro-model makes less errors than the best-of-breed commercial or open-source ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8% error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our model versus 11 GB VRAM for Whisper).
Autonomous vehicles rely on accurate trajectory prediction to inform decision-making processes related to navigation and collision avoidance. However, current trajectory prediction models show signs of overfitting, which may lead to unsafe or suboptimal behavior. To address these challenges, this paper presents a comprehensive framework that categorizes and assesses the definitions and strategies used in the literature on evaluating and improving the robustness of trajectory prediction models. This involves a detailed exploration of various approaches, including data slicing methods, perturbation techniques, model architecture changes, and post-training adjustments. In the literature, we see many promising methods for increasing robustness, which are necessary for safe and reliable autonomous driving.
The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
The extensive use of HPC infrastructures and frameworks for running dataintensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, application performance can be heavily affected by how data are partitioned, which in turn depends on the selected size for data blocks, i.e. the block size. Therefore, finding an effective partitioning, i.e. a suitable block size, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine Learning), for block size estimation that relies on supervised machine learning techniques. The proposed methodology was evaluated by designing an implementation tailored to dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of the provided implementation through an extensive experimental evaluation considering different algorithms from dislib, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show the ability of BLEST-ML to efficiently determine a suitable way to split a given dataset, thus providing a proof of its applicability to enable the efficient execution of data-parallel applications in high performance environments.
Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\'e de Montr\'eal.
Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.