In autonomous driving tasks, scene understanding is the first step towards predicting the future behavior of the surrounding traffic participants. Yet, how to represent a given scene and extract its features are still open research questions. In this study, we propose a novel text-based representation of traffic scenes and process it with a pre-trained language encoder. First, we show that text-based representations, combined with classical rasterized image representations, lead to descriptive scene embeddings. Second, we benchmark our predictions on the nuScenes dataset and show significant improvements compared to baselines. Third, we show in an ablation study that a joint encoder of text and rasterized images outperforms the individual encoders confirming that both representations have their complementary strengths.
Collaborative vehicle routing occurs when carriers collaborate through sharing their transportation requests and performing transportation requests on behalf of each other. This achieves economies of scale, thus reducing cost, greenhouse gas emissions and road congestion. But which carrier should partner with whom, and how much should each carrier be compensated? Traditional game theoretic solution concepts are expensive to calculate as the characteristic function scales exponentially with the number of agents. This would require solving the vehicle routing problem (NP-hard) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function; thus, when deployed in production, we only need to evaluate the expensive post-collaboration vehicle routing problem once. Our contribution is that we are the first to consider both the route allocation problem and gain sharing problem simultaneously - without access to the expensive characteristic function. Through decentralised machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value - a fair profit allocation mechanism. Importantly, we are able to achieve a reduction in run-time of 88%.
We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
With the advent of vehicles equipped with advanced driver-assistance systems, such as adaptive cruise control (ACC) and other automated driving features, the potential for cyberattacks on these automated vehicles (AVs) has emerged. While overt attacks that force vehicles to collide may be easily identified, more insidious attacks, which only slightly alter driving behavior, can result in network-wide increases in congestion, fuel consumption, and even crash risk without being easily detected. To address the detection of such attacks, we first present a traffic model framework for three types of potential cyberattacks: malicious manipulation of vehicle control commands, false data injection attacks on sensor measurements, and denial-of-service (DoS) attacks. We then investigate the impacts of these attacks at both the individual vehicle (micro) and traffic flow (macro) levels. A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence {to demonstrate} the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and observed to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.
Infinite-dimensional, holomorphic functions have been studied in detail over the last several decades, due to their relevance to parametric differential equations and computational uncertainty quantification. The approximation of such functions from finitely many samples is of particular interest, due to the practical importance of constructing surrogate models to complex mathematical models of physical processes. In a previous work, [5] we studied the approximation of so-called Banach-valued, $(\boldsymbol{b},\varepsilon)$-holomorphic functions on the infinite-dimensional hypercube $[-1,1]^{\mathbb{N}}$ from $m$ (potentially adaptive) samples. In particular, we derived lower bounds for the adaptive $m$-widths for classes of such functions, which showed that certain algebraic rates of the form $m^{1/2-1/p}$ are the best possible regardless of the sampling-recovery pair. In this work, we continue this investigation by focusing on the practical case where the samples are pointwise evaluations drawn identically and independently from a probability measure. Specifically, for Hilbert-valued $(\boldsymbol{b},\varepsilon)$-holomorphic functions, we show that the same rates can be achieved (up to a small polylogarithmic or algebraic factor) for essentially arbitrary tensor-product Jacobi (ultraspherical) measures. Our reconstruction maps are based on least squares and compressed sensing procedures using the corresponding orthonormal Jacobi polynomials. In doing so, we strengthen and generalize past work that has derived weaker nonuniform guarantees for the uniform and Chebyshev measures (and corresponding polynomials) only. We also extend various best $s$-term polynomial approximation error bounds to arbitrary Jacobi polynomial expansions. Overall, we demonstrate that i.i.d.\ pointwise samples are near-optimal for the recovery of infinite-dimensional, holomorphic functions.
Despite the continuous development of the different operational ensemble prediction systems over the past decades, ensemble forecasts still might suffer from lack of calibration and/or display systematic bias, thus require some post-processing to improve their forecast skill. Here we focus on visibility, which quantity plays a crucial role e.g. in aviation and road safety or in ship navigation, and propose a parametric model where the predictive distribution is a mixture of a gamma and a truncated normal distribution, both right censored at the maximal reported visibility value. The new model is evaluated in two case studies based on visibility ensemble forecasts of the European Centre for Medium-Range Weather Forecasts covering two distinct domains in Central and Western Europe and two different time periods. The results of the case studies indicate that climatology is substantially superior to the raw ensemble; nevertheless, the forecast skill can be further improved by post-processing, at least for short lead times. Moreover, the proposed mixture model consistently outperforms the Bayesian model averaging approach used as reference post-processing technique.
D scene graphs are an emerging 3D scene representation, that models both the objects present in the scene as well as their relationships. However, learning 3D scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3D scene graphs. To solve this issue, we present the first language-based pre-training approach for 3D scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted 3D graph features. Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.
This proposed model introduces novel deep learning methodologies. The objective here is to create a reliable intrusion detection mechanism to help identify malicious attacks. Deep learning based solution framework is developed consisting of three approaches. The first approach is Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) with seven optimizer functions such as adamax, SGD, adagrad, adam, RMSprop, nadam and adadelta. The model is evaluated on NSL-KDD dataset and classified multi attack classification. The model has outperformed with adamax optimizer in terms of accuracy, detection rate and low false alarm rate. The results of LSTM-RNN with adamax optimizer is compared with existing shallow machine and deep learning models in terms of accuracy, detection rate and low false alarm rate. The multi model methodology consisting of Recurrent Neural Network (RNN), Long-Short Term Memory Recurrent Neural Network (LSTM-RNN), and Deep Neural Network (DNN). The multi models are evaluated on bench mark datasets such as KDD99, NSL-KDD, and UNSWNB15 datasets. The models self-learnt the features and classifies the attack classes as multi-attack classification. The models RNN, and LSTM-RNN provide considerable performance compared to other existing methods on KDD99 and NSL-KDD dataset
We introduce a Bayesian conditional autoregressive model for analyzing patient-specific and neighborhood risks of stillbirth and preterm birth within a city. Our fully Bayesian approach automatically learns the amount of spatial heterogeneity and spatial dependence between neighborhoods. Our model provides meaningful inferences and uncertainty quantification for both covariate effects and neighborhood risk probabilities through their posterior distributions. We apply our methodology to data from the city of Philadelphia. Using electronic health records (45,919 deliveries at hospitals within the University of Pennsylvania Health System) and United States Census Bureau data from 363 census tracts in Philadelphia, we find that both patient-level characteristics (e.g. self-identified race/ethnicity) and neighborhood-level characteristics (e.g. violent crime) are highly associated with patients' odds of stillbirth or preterm birth. Our neighborhood risk analysis further reveals that census tracts in West Philadelphia and North Philadelphia are at highest risk of these outcomes. Specifically, neighborhoods with higher rates of women in poverty or on public assistance have greater neighborhood risk for these outcomes, while neighborhoods with higher rates of college-educated women or women in the labor force have lower risk. Our findings could be useful for targeted individual and neighborhood interventions.
In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey on VLP. We hope that this survey can shed light on future research in the VLP field.
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.