Autonomous robots and vehicles are expected to soon become an integral part of our environment. Unsatisfactory issues regarding interaction with existing road users, performance in mixed-traffic areas and lack of interpretable behavior remain key obstacles. To address these, we present a physics-based neural network, based on a hybrid approach combining a social force model extended by group force (SFMG) with Multi-Layer Perceptron (MLP) to predict pedestrian trajectories considering its interaction with static obstacles, other pedestrians and pedestrian groups. We quantitatively and qualitatively evaluate the model with respect to realistic prediction, prediction performance and prediction "interpretability". Initial results suggest, the model even when solely trained on a synthetic dataset, can predict realistic and interpretable trajectories with better than state-of-the-art accuracy.
Per-instance algorithm selection seeks to recommend, for a given problem instance and a given performance criterion, one or several suitable algorithms that are expected to perform well for the particular setting. The selection is classically done offline, using openly available information about the problem instance or features that are extracted from the instance during a dedicated feature extraction step. This ignores valuable information that the algorithms accumulate during the optimization process. In this work, we propose an alternative, online algorithm selection scheme which we coin per-run algorithm selection. In our approach, we start the optimization with a default algorithm, and, after a certain number of iterations, extract instance features from the observed trajectory of this initial optimizer to determine whether to switch to another optimizer. We test this approach using the CMA-ES as the default solver, and a portfolio of six different optimizers as potential algorithms to switch to. In contrast to other recent work on online per-run algorithm selection, we warm-start the second optimizer using information accumulated during the first optimization phase. We show that our approach outperforms static per-instance algorithm selection. We also compare two different feature extraction principles, based on exploratory landscape analysis and time series analysis of the internal state variables of the CMA-ES, respectively. We show that a combination of both feature sets provides the most accurate recommendations for our test cases, taken from the BBOB function suite from the COCO platform and the YABBOB suite from the Nevergrad platform.
We propose the molecular omics network (MOOMIN) a multimodal graph neural network used by AstraZeneca oncologists to predict the synergy of drug combinations for cancer treatment. Our model learns drug representations at multiple scales based on a drug-protein interaction network and metadata. Structural properties of compounds and proteins are encoded to create vertex features for a message-passing scheme that operates on the bipartite interaction graph. Propagated messages form multi-resolution drug representations which we utilized to create drug pair descriptors. By conditioning the drug combination representations on the cancer cell type we define a synergy scoring function that can inductively score unseen pairs of drugs. Experimental results on the synergy scoring task demonstrate that MOOMIN outperforms state-of-the-art graph fingerprinting, proximity preserving node embedding, and existing deep learning approaches. Further results establish that the predictive performance of our model is robust to hyperparameter changes. We demonstrate that the model makes high-quality predictions over a wide range of cancer cell line tissues, out-of-sample predictions can be validated with external synergy databases, and that the proposed model is data efficient at learning.
Demystifying the interactions among multiple agents from their past trajectories is fundamental to precise and interpretable trajectory prediction. However, previous works only consider pair-wise interactions with limited relational reasoning. To promote more comprehensive interaction modeling for relational reasoning, we propose GroupNet, a multiscale hypergraph neural network, which is novel in terms of both interaction capturing and representation learning. From the aspect of interaction capturing, we propose a trainable multiscale hypergraph to capture both pair-wise and group-wise interactions at multiple group sizes. From the aspect of interaction representation learning, we propose a three-element format that can be learnt end-to-end and explicitly reason some relational factors including the interaction strength and category. We apply GroupNet into both CVAE-based prediction system and previous state-of-the-art prediction systems for predicting socially plausible trajectories with relational reasoning. To validate the ability of relational reasoning, we experiment with synthetic physics simulations to reflect the ability to capture group behaviors, reason interaction strength and interaction category. To validate the effectiveness of prediction, we conduct extensive experiments on three real-world trajectory prediction datasets, including NBA, SDD and ETH-UCY; and we show that with GroupNet, the CVAE-based prediction system outperforms state-of-the-art methods. We also show that adding GroupNet will further improve the performance of previous state-of-the-art prediction systems.
We present a method to simulate movement in interaction with computers, using Model Predictive Control (MPC). The method starts from understanding interaction from an Optimal Feedback Control (OFC) perspective. We assume that users aim to minimize an internalized cost function, subject to the constraints imposed by the human body and the interactive system. In contrast to previous linear approaches used in HCI, MPC can compute optimal controls for nonlinear systems. This allows us to use state-of-the-art biomechanical models and handle nonlinearities that occur in almost any interactive system. Instead of torque actuation, our model employs second-order muscles acting directly at the joints. We compare three different cost functions and evaluate the simulated trajectories against user movements in a Fitts' Law type pointing study with four different interaction techniques. Our results show that the combination of distance, control, and joint acceleration cost matches individual users' movements best, and predicts movements with an accuracy that is within the between-user variance. To aid HCI researchers and designers, we introduce CFAT, a novel method to identify maximum voluntary torques in joint-actuated models based on experimental data, and give practical advice on how to simulate human movement for different users, interaction techniques, and tasks.
Fingerprint is an important biological feature of human body, which contains abundant gender information. At present, the academic research of fingerprint gender characteristics is generally at the level of understanding, while the standardization research is quite limited. In this work, we propose a more robust method, Dense Dilated Convolution ResNet (DDC-ResNet) to extract valid gender information from fingerprints. By replacing the normal convolution operations with the atrous convolution in the backbone, prior knowledge is provided to keep the edge details and the global reception field can be extended. We explored the results in 3 ways: 1) The efficiency of the DDC-ResNet. 6 typical methods of automatic feature extraction coupling with 9 mainstream classifiers are evaluated in our dataset with fair implementation details. Experimental results demonstrate that the combination of our approach outperforms other combinations in terms of average accuracy and separate-gender accuracy. It reaches 96.5% for average and 0.9752 (males)/0.9548 (females) for separate-gender accuracy. 2) The effect of fingers. It is found that the best performance of classifying gender with separate fingers is achieved by the right ring finger. 3) The effect of specific features. Based on the observations of the concentrations of fingerprints visualized by our approach, it can be inferred that loops and whorls (level 1), bifurcations (level 2), as well as line shapes (level 3) are connected with gender. Finally, we will open source the dataset that contains 6000 fingerprint images
Precisely modeling interactions and accurately predicting trajectories of surrounding vehicles are essential to the decision-making and path-planning of intelligent vehicles. This paper proposes a novel framework based on ensemble learning to improve the performance of trajectory predictions in interactive scenarios. The framework is termed Interactive Ensemble Trajectory Predictor (IETP). IETP assembles interaction-aware trajectory predictors as base learners to build an ensemble learner. Firstly, each base learner in IETP observes historical trajectories of vehicles in the scene. Then each base learner handles interactions between vehicles to predict trajectories. Finally, an ensemble learner is built to predict trajectories by applying two ensemble strategies on the predictions from all base learners. Predictions generated by the ensemble learner are final outputs of IETP. In this study, three experiments using different data are conducted based on the NGSIM dataset. Experimental results show that IETP improves the predicting accuracy and decreases the variance of errors compared to base learners. In addition, IETP exceeds baseline models with 50% of the training data, indicating that IETP is data-efficient. Moreover, the implementation of IETP is publicly available at //github.com/BIT-Jack/IETP.
This paper presents a hybrid robot motion planner that generates long-horizon motion plans for robot navigation in environments with obstacles. We propose a hybrid planner, RRT* with segmented trajectory optimization (RRT*-sOpt), which combines the merits of sampling-based planning, optimization-based planning, and trajectory splitting to quickly plan for a collision-free and dynamically-feasible motion plan. When generating a plan, the RRT* layer quickly samples a semi-optimal path and sets it as an initial reference path. Then, the sOpt layer splits the reference path and performs optimization on each segment. It then splits the new trajectory again and repeats the process until the whole trajectory converges. We also propose to reduce the number of segments before convergence with the aim of further reducing computation time. Simulation results show that RRT*-sOpt benefits from the hybrid structure with trajectory splitting and performs robustly in various robot platforms and scenarios.
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving for its powerful spatial representation ability. It is challenging to estimate the BEV semantic maps from monocular images due to the spatial gap, since it is implicitly required to realize both the perspective-to-BEV transformation and segmentation. We present a novel two-stage Geometry Prior-based Transformation framework named GitNet, consisting of (i) the geometry-guided pre-alignment and (ii) ray-based transformer. In the first stage, we decouple the BEV segmentation into the perspective image segmentation and geometric prior-based mapping, with explicit supervision by projecting the BEV semantic labels onto the image plane to learn visibility-aware features and learnable geometry to translate into BEV space. Second, the pre-aligned coarse BEV features are further deformed by ray-based transformers to take visibility knowledge into account. GitNet achieves the leading performance on the challenging nuScenes and Argoverse Datasets. The code will be publicly available.
Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications. With the growing use of unmanned aerial systems (UASs), MOT methods for aerial surveillance is in high demand. Application of MOT in UAS presents specific challenges such as moving sensor, changing zoom levels, dynamic background, illumination changes, obscurations and small objects. In this work, we present a robust object tracking architecture aimed to accommodate for the noise in real-time situations. We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space. DeepEKF utilizes a learned image embedding along with an attention mechanism trained to weight the importance of areas in an image to predict future states. For the visual scoring, we experiment with different similarity measures to calculate distance based on entity appearances, including a convolutional neural network (CNN) encoder, pre-trained using Siamese networks. In initial evaluation experiments, we show that our method, combining scoring structure of the kinematic and visual models within a MHT framework, has improved performance especially in edge cases where entity motion is unpredictable, or the data presents frames with significant gaps.
In structure learning, the output is generally a structure that is used as supervision information to achieve good performance. Considering the interpretation of deep learning models has raised extended attention these years, it will be beneficial if we can learn an interpretable structure from deep learning models. In this paper, we focus on Recurrent Neural Networks (RNNs) whose inner mechanism is still not clearly understood. We find that Finite State Automaton (FSA) that processes sequential data has more interpretable inner mechanism and can be learned from RNNs as the interpretable structure. We propose two methods to learn FSA from RNN based on two different clustering methods. We first give the graphical illustration of FSA for human beings to follow, which shows the interpretability. From the FSA's point of view, we then analyze how the performance of RNNs are affected by the number of gates, as well as the semantic meaning behind the transition of numerical hidden states. Our results suggest that RNNs with simple gated structure such as Minimal Gated Unit (MGU) is more desirable and the transitions in FSA leading to specific classification result are associated with corresponding words which are understandable by human beings.