Due to the limited availability of data, existing few-shot learning methods trained from scratch fail to achieve satisfactory performance. In contrast, large-scale pre-trained models such as CLIP demonstrate remarkable few-shot and zero-shot capabilities. To enhance the performance of pre-trained models for downstream tasks, fine-tuning the model on downstream data is frequently necessary. However, fine-tuning the pre-trained model leads to a decrease in its generalizability in the presence of distribution shift, while the limited number of samples in few-shot learning makes the model highly susceptible to overfitting. Consequently, existing methods for fine-tuning few-shot learning primarily focus on fine-tuning the model's classification head or introducing additional structure. In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align). Our method aims to bolster the model's generalizability by preserving the consistency of spurious features across the fine-tuning process. Extensive experimental results validate the efficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements. Our code can be found in //github.com/skingorz/FD-Align.
Machine learning (ML) models are fundamentally shaped by data, and building inclusive ML systems requires significant considerations around how to design representative datasets. Yet, few novice-oriented ML modeling tools are designed to foster hands-on learning of dataset design practices, including how to design for data diversity and inspect for data quality. To this end, we outline a set of four data design practices (DDPs) for designing inclusive ML models and share how we designed a tablet-based application called Co-ML to foster learning of DDPs through a collaborative ML model building experience. With Co-ML, beginners can build image classifiers through a distributed experience where data is synchronized across multiple devices, enabling multiple users to iteratively refine ML datasets in discussion and coordination with their peers. We deployed Co-ML in a 2-week-long educational AIML Summer Camp, where youth ages 13-18 worked in groups to build custom ML-powered mobile applications. Our analysis reveals how multi-user model building with Co-ML, in the context of student-driven projects created during the summer camp, supported development of DDPs including incorporating data diversity, evaluating model performance, and inspecting for data quality. Additionally, we found that students' attempts to improve model performance often prioritized learnability over class balance. Through this work, we highlight how the combination of collaboration, model testing interfaces, and student-driven projects can empower learners to actively engage in exploring the role of data in ML systems.
This study presents a deep learning-based approach to seismic velocity inversion problem, focusing on both noisy and noiseless training datasets of varying sizes. Our Seismic Velocity Inversion Network (SVInvNet) introduces a novel architecture that contains a multi-connection encoder-decoder structure enhanced with dense blocks. This design is specifically tuned to effectively process complex information, crucial for addressing the challenges of non-linear seismic velocity inversion. For training and testing, we created diverse seismic velocity models, including multi-layered, faulty, and salt dome categories. We also investigated how different kinds of ambient noise, both coherent and stochastic, and the size of the training dataset affect learning outcomes. SVInvNet is trained on datasets ranging from 750 to 6,000 samples and is tested using a large benchmark dataset of 12,000 samples. Despite its fewer parameters compared to the baseline, SVInvNet achieves superior performance with this dataset. The outcomes of the SVInvNet are additionally compared to those of the Full Waveform Inversion (FWI) method. The comparative analysis clearly reveals the effectiveness of the proposed model.
This paper presents a novel solution to address the challenges in achieving energy efficiency and cooperation for collision avoidance in UAV swarms. The proposed method combines Artificial Potential Field (APF) and Particle Swarm Optimization (PSO) techniques. APF provides environmental awareness and implicit coordination to UAVs, while PSO searches for collision-free and energy-efficient trajectories for each UAV in a decentralized manner under the implicit coordination. This decentralized approach is achieved by minimizing a novel cost function that leverages the advantages of the active contour model from image processing. Additionally, future trajectories are predicted by approximating the minima of the novel cost function using calculus of variation, which enables proactive actions and defines the initial conditions for PSO. We propose a two-branch trajectory planning framework that ensures UAVs only change altitudes when necessary for energy considerations. Extensive experiments are conducted to evaluate the effectiveness and efficiency of our method in various situations.
This paper underscores the vital role of the chi-square test within political science research utilizing structural equation modeling (SEM). The ongoing debate regarding the inclusion of chi-square test statistics alongside fit indices in result presentations has sparked controversy. Despite the recognized limitations of relying solely on the chi-square test, its judicious application can enhance its effectiveness in evaluating model fit and specification. To exemplify this, we present three common scenarios pertinent to political science research where fit indices may inadequately address goodness-of-fit concerns, while the chi-square statistic can be effectively harnessed. Through Monte Carlo simulations, we examine strategies for enhancing chi-square tests within these scenarios, showcasing the potential of appropriately employed chi-square tests to provide a comprehensive model fit assessment. Our recommendation is to report both the chi-square test and fit indices, with a priority on precise model specification to ensure the trustworthiness of model fit indicators.
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
This work presents an investigation and assessment framework, which, supported by realistic data, aims at provisioning operators with in-depth insights into the consumer-perceived Quality-of-Experience (QoE) at public Electric Vehicle (EV) charging infrastructures. Motivated by the unprecedented EV market growth, it is suspected that the existing charging infrastructure will soon be no longer capable of sustaining the rapidly growing charging demands; let alone that the currently adopted ad hoc infrastructure expansion strategies seem to be far from contributing any quality service sustainability solutions that tangibly reduce (ultimately mitigate) the severity of this problem. Without suitable QoE metrics, operators, today, face remarkable difficulty in assessing the performance of EV Charging Stations (EVCSs) in this regard. This paper aims at filling this gap through the formulation of novel and original critical QoE performance metrics that provide operators with visibility into the per-EVCS operational dynamics and allow for the optimization of these stations' respective utilization. Such metrics shall then be used as inputs to a Machine Learning model finely tailored and trained using recent real-world data sets for the purpose of forecasting future long-term EVCS loads. This will, in turn, allow for making informed optimal EV charging infrastructure expansions that will be capable of reliably coping with the rising EV charging demands and maintaining acceptable QoE levels. The model's accuracy has been tested and extensive simulations are conducted to evaluate the achieved performance in terms of the above listed metrics and show the suitability of the recommended infrastructure expansions.
Supervised imitation learning, also known as behavioral cloning, suffers from distribution drift leading to failures during policy execution. One approach to mitigate this issue is to allow an expert to correct the agent's actions during task execution, based on the expert's determination that the agent has reached a `point of no return.' The agent's policy is then retrained using this new corrective data. This approach alone can enable high-performance agents to be learned, but at a substantial cost: the expert must vigilantly observe execution until the policy reaches a specified level of success, and even at that point, there is no guarantee that the policy will always succeed. To address these limitations, we present FIRE (Failure Identification to Reduce Expert Burden in intervention-based learning), a system that can predict when a running policy will fail, halt its execution, and request a correction from the expert. Unlike existing approaches that learn only from expert data, our approach learns from both expert and non-expert data, akin to adversarial learning. We demonstrate experimentally for a series of challenging manipulation tasks that our method is able to recognize state-action pairs that lead to failures. This permits seamless integration into an intervention-based learning system, where we show an order-of-magnitude gain in sample efficiency compared with a state-of-the-art inverse reinforcement learning method and dramatically improved performance over an equivalent amount of data learned with behavioral cloning.
The rapid development of deep learning has made a great progress in segmentation, one of the fundamental tasks of computer vision. However, the current segmentation algorithms mostly rely on the availability of pixel-level annotations, which are often expensive, tedious, and laborious. To alleviate this burden, the past years have witnessed an increasing attention in building label-efficient, deep-learning-based segmentation algorithms. This paper offers a comprehensive review on label-efficient segmentation methods. To this end, we first develop a taxonomy to organize these methods according to the supervision provided by different types of weak labels (including no supervision, coarse supervision, incomplete supervision and noisy supervision) and supplemented by the types of segmentation problems (including semantic segmentation, instance segmentation and panoptic segmentation). Next, we summarize the existing label-efficient segmentation methods from a unified perspective that discusses an important question: how to bridge the gap between weak supervision and dense prediction -- the current methods are mostly based on heuristic priors, such as cross-pixel similarity, cross-label constraint, cross-view consistency, cross-image relation, etc. Finally, we share our opinions about the future research directions for label-efficient deep segmentation.
Learning with limited data is a key challenge for visual recognition. Few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them is the target task. In this paper, we propose a novel approach to adapt the embedding model to the target classification task, yielding embeddings that are task-specific and are discriminative. To this end, we employ a type of self-attention mechanism called Transformer to transform the embeddings from task-agnostic to task-specific by focusing on relating instances from the test instances to the training instances in both seen and unseen classes. Our approach also extends to both transductive and generalized few-shot classification, two important settings that have essential use cases. We verify the effectiveness of our model on two standard benchmark few-shot classification datasets --- MiniImageNet and CUB, where our approach demonstrates state-of-the-art empirical performance.
State-of-the-art Convolutional Neural Network (CNN) benefits a lot from multi-task learning (MTL), which learns multiple related tasks simultaneously to obtain shared or mutually related representations for different tasks. The most widely-used MTL CNN structure is based on an empirical or heuristic split on a specific layer (e.g., the last convolutional layer) to minimize different task-specific losses. However, this heuristic sharing/splitting strategy may be harmful to the final performance of one or multiple tasks. In this paper, we propose a novel CNN structure for MTL, which enables automatic feature fusing at every layer. Specifically, we first concatenate features from different tasks according to their channel dimension, and then formulate the feature fusing problem as discriminative dimensionality reduction. We show that this discriminative dimensionality reduction can be done by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN, which we refer to as Neural Discriminative Dimensionality Reduction (NDDR). We perform ablation analysis in details for different configurations in training the network. The experiments carried out on different network structures and different task sets demonstrate the promising performance and desirable generalizability of our proposed method.