Age and gender recognition in the wild is a highly challenging task: apart from the variability of conditions, pose complexities, and varying image quality, there are cases where the face is partially or completely occluded. We present MiVOLO (Multi Input VOLO), a straightforward approach for age and gender estimation using the latest vision transformer. Our method integrates both tasks into a unified dual input/output model, leveraging not only facial information but also person image data. This improves the generalization ability of our model and enables it to deliver satisfactory results even when the face is not visible in the image. To evaluate our proposed model, we conduct experiments on four popular benchmarks and achieve state-of-the-art performance, while demonstrating real-time processing capabilities. Additionally, we introduce a novel benchmark based on images from the Open Images Dataset. The ground truth annotations for this benchmark have been meticulously generated by human annotators, resulting in high accuracy answers due to the smart aggregation of votes. Furthermore, we compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges. Finally, we grant public access to our models, along with the code for validation and inference. In addition, we provide extra annotations for used datasets and introduce our new benchmark.
Lane detection is a critical function for autonomous driving. With the recent development of deep learning and the publication of camera lane datasets and benchmarks, camera lane detection networks (CLDNs) have been remarkably developed. Unfortunately, CLDNs rely on camera images which are often distorted near the vanishing line and prone to poor lighting condition. This is in contrast with Lidar lane detection networks (LLDNs), which can directly extract the lane lines on the bird's eye view (BEV) for motion planning and operate robustly under various lighting conditions. However, LLDNs have not been actively studied, mostly due to the absence of large public lidar lane datasets. In this paper, we introduce KAIST-Lane (K-Lane), the world's first and the largest public urban road and highway lane dataset for Lidar. K-Lane has more than 15K frames and contains annotations of up to six lanes under various road and traffic conditions, e.g., occluded roads of multiple occlusion levels, roads at day and night times, merging (converging and diverging) and curved lanes. We also provide baseline networks we term Lidar lane detection networks utilizing global feature correlator (LLDN-GFC). LLDN-GFC exploits the spatial characteristics of lane lines on the point cloud, which are sparse, thin, and stretched along the entire ground plane of the point cloud. From experimental results, LLDN-GFC achieves the state-of-the-art performance with an F1- score of 82.1%, on the K-Lane. Moreover, LLDN-GFC shows strong performance under various lighting conditions, which is unlike CLDNs, and also robust even in the case of severe occlusions, unlike LLDNs using the conventional CNN. The K-Lane, LLDN-GFC training code, pre-trained models, and complete development kits including evaluation, visualization and annotation tools are available at //github.com/kaist-avelab/k-lane.
Autonomous driving has become an important research area in recent years, and the corresponding system creates an enormous demand for computations. Heterogeneous computing platforms such as systems-on-chip that combine CPUs with reprogrammable hardware offer both computational performance and flexibility and are thus interesting targets for autonomous driving architectures. The de-facto software architecture standard in robotics, including autonomous driving systems, is ROS 2. ReconROS is a framework for creating robotics applications that extends ROS 2 with the possibility of mapping compute-intense functions to hardware. This paper presents AutonomROS, an autonomous driving unit based on the ReconROS framework. AutonomROS serves as a blueprint for a larger robotics application developed with ReconROS and demonstrates its suitability and extendability. The application integrates the ROS 2 package Navigation 2 with custom-developed software and hardware-accelerated functions for point cloud generation, obstacle detection, and lane detection. In addition, we detail a new communication middleware for shared memory communication between software and hardware functions. We evaluate AutonomROS and show the advantage of hardware acceleration and the new communication middleware for improving turnaround times, achievable frame rates, and, most importantly, reducing CPU load.
Physics-informed neural networks (PINNs) have gained prominence for their capability to tackle supervised learning tasks that conform to physical laws, notably nonlinear partial differential equations (PDEs). This paper presents "PINNs-TF2", a Python package built on the TensorFlow V2 framework. It not only accelerates PINNs implementation but also simplifies user interactions by abstracting complex PDE challenges. We underscore the pivotal role of compilers in PINNs, highlighting their ability to boost performance by up to 119x. Across eight diverse examples, our package, integrated with XLA compilers, demonstrated its flexibility and achieved an average speed-up of 18.12 times over TensorFlow V1. Moreover, a real-world case study is implemented to underscore the compilers' potential to handle many trainable parameters and large batch sizes. For community engagement and future enhancements, our package's source code is openly available at: //github.com/rezaakb/pinns-tf2.
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and Variance Exploding stochastic differential equation using joint probability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS dataset and 0.7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.
Manipulating objects without grasping them is an essential component of human dexterity, referred to as non-prehensile manipulation. Non-prehensile manipulation may enable more complex interactions with the objects, but also presents challenges in reasoning about gripper-object interactions. In this work, we introduce Hybrid Actor-Critic Maps for Manipulation (HACMan), a reinforcement learning approach for 6D non-prehensile manipulation of objects using point cloud observations. HACMan proposes a temporally-abstracted and spatially-grounded object-centric action representation that consists of selecting a contact location from the object point cloud and a set of motion parameters describing how the robot will move after making contact. We modify an existing off-policy RL algorithm to learn in this hybrid discrete-continuous action representation. We evaluate HACMan on a 6D object pose alignment task in both simulation and in the real world. On the hardest version of our task, with randomized initial poses, randomized 6D goals, and diverse object categories, our policy demonstrates strong generalization to unseen object categories without a performance drop, achieving an 89% success rate on unseen objects in simulation and 50% success rate with zero-shot transfer in the real world. Compared to alternative action representations, HACMan achieves a success rate more than three times higher than the best baseline. With zero-shot sim2real transfer, our policy can successfully manipulate unseen objects in the real world for challenging non-planar goals, using dynamic and contact-rich non-prehensile skills. Videos can be found on the project website: //hacman-2023.github.io.
Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
The vision of AI collaborators has long been a staple of stories and science fiction, where artificial agents understand nuances of collaboration and human communication. They assist their human partners and teams and have special talents. Government advisory groups and leaders in AI have advocated for years that AIs should be human compatible and effective collaborators. Nonetheless, robust AIs that collaborate like talented people remain out of reach. The simpler dream of effective information tools that augment human intelligence (IA) has its roots in the 1960s and helped to drive an information technology revolution. With the vast increase in hybrid and remote work since the COVID pandemic, the benefits and requirements for better coordination, collaboration, and communication are in focus for the workplace. Many factors (such as the costs of homes near work) are impeding a mass return to in-person work at the office. Are human-like AI teammates part of a solution? If we just need better tools for collaboration, how artificially intelligent (AI) could and should these tools be? This position paper reviews the arc of technology and calls by others for human-machine teaming. It draws on earlier research in psychology and the social sciences about what human-like collaboration actually requires. This paper argues that current mainstream AI cannot produce robust, intelligent, and human-compatible collaborators. It sets a context for a second paper that proposes exploring a radical shift in technology and methodology for creating resilient, intelligent, and human-compatible AIs (Stefik & Price, 2023). The aspirational goal is that such AIs would learn, share what they learn, and collaborate to achieve high standards.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, and Wide ResNet 28-10 architectures, our methodology improves upon both deep and batch ensembles.
Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.