This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at //github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
Arbitrary Pattern Formation (APF) is a fundamental coordination problem in swarm robotics. It requires a set of autonomous robots (mobile computing units) to form any arbitrary pattern (given as input) starting from any initial pattern. The APF problem is well-studied in both continuous and discrete settings. This work concerns the discrete version of the problem. A set of robots is placed on the nodes of an infinite rectangular grid graph embedded in a euclidean plane. The movements of the robots are restricted to one of the four neighboring grid nodes from its current position. The robots are autonomous, anonymous, identical, and homogeneous, and operate Look-Compute-Move cycles. Here we have considered the classical $\mathcal{OBLOT}$ robot model, i.e., the robots have no persistent memory and no explicit means of communication. The robots have full unobstructed visibility. This work proposes an algorithm that solves the APF problem in a fully asynchronous scheduler under this setting assuming the initial configuration is asymmetric. The considered performance measures of the algorithm are space and number of moves required for the robots. The algorithm is asymptotically move-optimal. A definition of the space-complexity is presented here. We observe an obvious lower bound $\mathcal{D}$ of the space complexity and show that the proposed algorithm has the space complexity $\mathcal{D}+4$. On comparing with previous related works, we show that this is the first proposed algorithm considering $\mathcal{OBLOT}$ robot model that is asymptotically move-optimal and has the least space complexity which is almost optimal.
This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.
Automated vehicle acceptance (AVA) has been measured mostly subjectively by questionnaires and interviews, with a main focus on drivers inside automated vehicles (AVs). To ensure that AVs are widely accepted by the public, ensuring the acceptance by both drivers and passengers is key. The in-vehicle experience of passengers will determine the extent to which AVs will be accepted by passengers. A comprehensive understanding of potential assessment methods to measure the passenger experience in AVs is needed to improve the in-vehicle experience of passengers and thereby the acceptance. The present work provides an overview of assessment methods that were used to measure a driver's behavior, and cognitive and emotional states during (automated) driving. The results of the review have shown that these assessment methods can be classified by type of data-collection method (e.g., questionnaires, interviews, direct input devices, sensors), object of their measurement (i.e., perception, behavior, state), time of measurement, and degree of objectivity of the data collected. A conceptual model synthesizes the results of the literature review, formulating relationships between the factors constituting the in-vehicle experience and AVA acceptance. It is theorized that the in-vehicle experience influences the intention to use, with intention to use serving as predictor of actual use. The model also formulates relationships between actual use and well-being. A combined approach of using both subjective and objective assessment methods is needed to provide more accurate estimates for AVA, and advance the uptake and use of AVs.
For drones, as safety-critical systems, there is an increasing need for onboard detect & avoid (DAA) technology i) to see, sense or detect conflicting traffic or imminent non-cooperative threats due to their high mobility with multiple degrees of freedom and the complexity of deployed unstructured environments, and subsequently ii) to take the appropriate actions to avoid collisions depending upon the level of autonomy. The safe and efficient integration of UAV traffic management (UTM) systems with air traffic management (ATM) systems, using intelligent autonomous approaches, is an emerging requirement where the number of diverse UAV applications is increasing on a large scale in dense air traffic environments for completing swarms of multiple complex missions flexibly and simultaneously. Significant progress over the past few years has been made in detecting UAVs present in aerospace, identifying them, and determining their existing flight path. This study makes greater use of electronic conspicuity (EC) information made available by PilotAware Ltd in developing an advanced collision management methodology -- Drone Aware Collision Management (DACM) -- capable of determining and executing a variety of time-optimal evasive collision avoidance (CA) manoeuvres using a reactive geometric conflict detection and resolution (CDR) technique. The merits of the DACM methodology have been demonstrated through extensive simulations and real-world field tests in avoiding mid-air collisions (MAC) between UAVs and manned aeroplanes. The results show that the proposed methodology can be employed successfully in avoiding collisions while limiting the deviation from the original trajectory in highly dynamic aerospace without requiring sophisticated sensors and prior training.
In quantum machine field, detecting two-dimensional (2D) materials in Silicon chips is one of the most critical problems. Instance segmentation can be considered as a potential approach to solve this problem. However, similar to other deep learning methods, the instance segmentation requires a large scale training dataset and high quality annotation in order to achieve a considerable performance. In practice, preparing the training dataset is a challenge since annotators have to deal with a large image, e.g 2K resolution, and extremely dense objects in this problem. In this work, we present a novel method to tackle the problem of missing annotation in instance segmentation in 2D quantum material identification. We propose a new mechanism for automatically detecting false negative objects and an attention based loss strategy to reduce the negative impact of these objects contributing to the overall loss function. We experiment on the 2D material detection datasets, and the experiments show our method outperforms previous works.
In recent years, the employment of deep learning methods has led to several significant breakthroughs in artificial intelligence. Different from traditional machine learning models, deep learning-based approaches are able to extract features autonomously from raw data. This allows for bypassing the feature engineering process, which is generally considered to be both error-prone and tedious. Moreover, deep learning strategies often outperform traditional models in terms of accuracy.
Recently, Neural Topic Models (NTM), inspired by variational autoencoders, have attracted a lot of research interest; however, these methods have limited applications in the real world due to the challenge of incorporating human knowledge. This work presents a semi-supervised neural topic modeling method, vONTSS, which uses von Mises-Fisher (vMF) based variational autoencoders and optimal transport. When a few keywords per topic are provided, vONTSS in the semi-supervised setting generates potential topics and optimizes topic-keyword quality and topic classification. Experiments show that vONTSS outperforms existing semi-supervised topic modeling methods in classification accuracy and diversity. vONTSS also supports unsupervised topic modeling. Quantitative and qualitative experiments show that vONTSS in the unsupervised setting outperforms recent NTMs on multiple aspects: vONTSS discovers highly clustered and coherent topics on benchmark datasets. It is also much faster than the state-of-the-art weakly supervised text classification method while achieving similar classification performance. We further prove the equivalence of optimal transport loss and cross-entropy loss at the global minimum.
The investigation of mixture models is a key to understand and visualize the distribution of multivariate data. Most mixture models approaches are based on likelihoods, and are not adapted to distribution with finite support or without a well-defined density function. This study proposes the Augmented Quantization method, which is a reformulation of the classical quantization problem but which uses the p-Wasserstein distance. This metric can be computed in very general distribution spaces, in particular with varying supports. The clustering interpretation of quantization is revisited in a more general framework. The performance of Augmented Quantization is first demonstrated through analytical toy problems. Subsequently, it is applied to a practical case study involving river flooding, wherein mixtures of Dirac and Uniform distributions are built in the input space, enabling the identification of the most influential variables.
Retrieval augmentation, which enhances downstream models by a knowledge retriever and an external corpus instead of by merely increasing the number of model parameters, has been successfully applied to many natural language processing (NLP) tasks such as text classification, question answering and so on. However, existing methods that separately or asynchronously train the retriever and downstream model mainly due to the non-differentiability between the two parts, usually lead to degraded performance compared to end-to-end joint training. In this paper, we propose Differentiable Retrieval Augmentation via Generative lANguage modeling(Dragan), to address this problem by a novel differentiable reformulation. We demonstrate the effectiveness of our proposed method on a challenging NLP task in e-commerce search, namely query intent classification. Both the experimental results and ablation study show that the proposed method significantly and reasonably improves the state-of-the-art baselines on both offline evaluation and online A/B test.
In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey on VLP. We hope that this survey can shed light on future research in the VLP field.