With the rapid advancements in autonomous driving and robot navigation, there is a growing demand for lifelong learning models capable of estimating metric (absolute) depth. Lifelong learning approaches potentially offer significant cost savings in terms of model training, data storage, and collection. However, the quality of RGB images and depth maps is sensor-dependent, and depth maps in the real world exhibit domain-specific characteristics, leading to variations in depth ranges. These challenges limit existing methods to lifelong learning scenarios with small domain gaps and relative depth map estimation. To facilitate lifelong metric depth learning, we identify three crucial technical challenges that require attention: i) developing a model capable of addressing the depth scale variation through scale-aware depth learning, ii) devising an effective learning strategy to handle significant domain gaps, and iii) creating an automated solution for domain-aware depth inference in practical applications. Based on the aforementioned considerations, in this paper, we present i) a lightweight multi-head framework that effectively tackles the depth scale imbalance, ii) an uncertainty-aware lifelong learning solution that adeptly handles significant domain gaps, and iii) an online domain-specific predictor selection method for real-time inference. Through extensive numerical studies, we show that the proposed method can achieve good efficiency, stability, and plasticity, leading the benchmarks by 8% to 15%.
Modern cyber attackers use advanced zero-day exploits, highly targeted spear phishing, and other social engineering techniques to gain access and also use evasion techniques to maintain a prolonged presence within the victim network while working gradually towards the objective. To minimize the damage, it is necessary to detect these Advanced Persistent Threats as early in the campaign as possible. This paper proposes, Prov2Vec, a system for the continuous monitoring of enterprise host's behavior to detect attackers' activities. It leverages the data provenance graph built using system event logs to get complete visibility into the execution state of an enterprise host and the causal relationship between system entities. It proposes a novel provenance graph kernel to obtain the canonical representation of the system behavior, which is compared against its historical behaviors and that of other hosts to detect the deviation from the normality. These representations are used in several machine learning models to evaluate their ability to capture the underlying behavior of an endpoint host. We have empirically demonstrated that the provenance graph kernel produces a much more compact representation compared to existing methods while improving prediction ability.
Motivated by humans' ability to adapt skills in the learning of new ones, this paper presents AdaptNet, an approach for modifying the latent space of existing policies to allow new behaviors to be quickly learned from like tasks in comparison to learning from scratch. Building on top of a given reinforcement learning controller, AdaptNet uses a two-tier hierarchy that augments the original state embedding to support modest changes in a behavior and further modifies the policy network layers to make more substantive changes. The technique is shown to be effective for adapting existing physics-based controllers to a wide range of new styles for locomotion, new task targets, changes in character morphology and extensive changes in environment. Furthermore, it exhibits significant increase in learning efficiency, as indicated by greatly reduced training times when compared to training from scratch or using other approaches that modify existing policies.
Modern automated surveillance techniques are heavily reliant on deep learning methods. Despite the superior performance, these learning systems are inherently vulnerable to adversarial attacks - maliciously crafted inputs that are designed to mislead, or trick, models into making incorrect predictions. An adversary can physically change their appearance by wearing adversarial t-shirts, glasses, or hats or by specific behavior, to potentially avoid various forms of detection, tracking and recognition of surveillance systems; and obtain unauthorized access to secure properties and assets. This poses a severe threat to the security and safety of modern surveillance systems. This paper reviews recent attempts and findings in learning and designing physical adversarial attacks for surveillance applications. In particular, we propose a framework to analyze physical adversarial attacks and provide a comprehensive survey of physical adversarial attacks on four key surveillance tasks: detection, identification, tracking, and action recognition under this framework. Furthermore, we review and analyze strategies to defend against the physical adversarial attacks and the methods for evaluating the strengths of the defense. The insights in this paper present an important step in building resilience within surveillance systems to physical adversarial attacks.
Reinforcement learning methods, while effective for learning robotic navigation strategies, are known to be highly sample inefficient. This sample inefficiency comes in part from not suitably balancing the explore-exploit dilemma, especially in the presence of non-stationarity, during policy optimization. To incorporate a balance of exploration-exploitation for sample efficiency, we propose Ada-NAV, an adaptive trajectory length scheme where the length grows as a policy's randomness, represented by its Shannon or differential entropy, decreases. Our adaptive trajectory length scheme emphasizes exploration at the beginning of training due to more frequent gradient updates and emphasizes exploitation later on with longer trajectories. In gridworld, simulated robotic environments, and real-world robotic experiments, we demonstrate the merits of the approach over constant and randomly sampled trajectory lengths in terms of performance and sample efficiency. For a fixed sample budget, Ada-NAV results in an 18% increase in navigation success rate, a 20-38% decrease in the navigation path length, and 9.32% decrease in the elevation cost compared to the policies obtained by the other methods. We also demonstrate that Ada-NAV can be transferred and integrated into a Clearpath Husky robot without significant performance degradation.
Human motion driven control (HMDC) is an effective approach for generating natural and compelling robot motions while preserving high-level semantics. However, establishing the correspondence between humans and robots with different body structures is not straightforward due to the mismatches in kinematics and dynamics properties, which causes intrinsic ambiguity to the problem. Many previous algorithms approach this motion retargeting problem with unsupervised learning, which requires the prerequisite skill sets. However, it will be extremely costly to learn all the skills without understanding the given human motions, particularly for high-dimensional robots. In this work, we introduce CrossLoco, a guided unsupervised reinforcement learning framework that simultaneously learns robot skills and their correspondence to human motions. Our key innovation is to introduce a cycle-consistency-based reward term designed to maximize the mutual information between human motions and robot states. We demonstrate that the proposed framework can generate compelling robot motions by translating diverse human motions, such as running, hopping, and dancing. We quantitatively compare our CrossLoco against the manually engineered and unsupervised baseline algorithms along with the ablated versions of our framework and demonstrate that our method translates human motions with better accuracy, diversity, and user preference. We also showcase its utility in other applications, such as synthesizing robot movements from language input and enabling interactive robot control.
We present MOTLEE, a distributed mobile multi-object tracking algorithm that enables a team of robots to collaboratively track moving objects in the presence of localization error. Existing approaches to distributed tracking make limiting assumptions regarding the relative spatial relationship of sensors, including assuming a static sensor network or that perfect localization is available. Instead, we develop an algorithm based on the Kalman-Consensus filter for distributed tracking that properly leverages localization uncertainty in collaborative tracking. Further, our method allows the team to maintain an accurate understanding of dynamic objects in the environment by realigning robot frames and incorporating frame alignment uncertainty into our object tracking formulation. We evaluate our method in hardware on a team of three mobile ground robots tracking four people. Compared to previous works that do not account for localization error, we show that MOTLEE is resilient to localization uncertainties, enabling accurate tracking in distributed, dynamic settings with mobile tracking sensors.
Despite the recent progress in deep learning, most approaches still go for a silo-like solution, focusing on learning each task in isolation: training a separate neural network for each individual task. Many real-world problems, however, call for a multi-modal approach and, therefore, for multi-tasking models. Multi-task learning (MTL) aims to leverage useful information across tasks to improve the generalization capability of a model. This thesis is concerned with multi-task learning in the context of computer vision. First, we review existing approaches for MTL. Next, we propose several methods that tackle important aspects of multi-task learning. The proposed methods are evaluated on various benchmarks. The results show several advances in the state-of-the-art of multi-task learning. Finally, we discuss several possibilities for future work.
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at //github.com/KaiyangZhou/CoOp.
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.
This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted.