Self-supervised pre-trained speech models have strongly improved speech recognition, yet they are still sensitive to domain shifts and accented or atypical speech. Many of these models rely on quantisation or clustering to learn discrete acoustic units. We propose to correct the discovered discrete units for accented speech back to a standard pronunciation in an unsupervised manner. A masked language model is trained on discrete units from a standard accent and iteratively corrects an accented token sequence by masking unexpected cluster sequences and predicting their common variant. Small accent adapter blocks are inserted in the pre-trained model and fine-tuned by predicting the corrected clusters, which leads to an increased robustness of the pre-trained model towards a target accent, and this without supervision. We are able to improve a state-of-the-art HuBERT Large model on a downstream accented speech recognition task by altering the training regime with the proposed method.
IPv6 is a fundamentally different Internet Protocol than IPv4, and IPv6-only networks cannot, by default, communicate with the IPv4 Internet. This lack of interoperability necessitates complex mechanisms for incremental deployment and bridging networks so that non-dual-stack systems can interact with the whole Internet. NAT64 is one such bridging mechanism by which a network allows IPv6-only clients to connect to the entire Internet, leveraging DNS to identify IPv4-only networks, inject IPv6 response addresses pointing to an internal gateway, and seamlessly translate connections. To date, our understanding of NAT64 deployments is limited; what little information exists is largely qualitative, taken from mailing lists and informal discussions. In this work, we present a first look at the active measurement of NAT64 deployment on the Internet focused on deployment prevalence, configuration, and security. We seek to measure NAT64 via two distinct large-scale measurements: 1) open resolvers on the Internet, and 2) client measurements from RIPE Atlas. For both datasets, we broadly find that despite substantial anecdotal reports of NAT64 deployment, measurable deployments are exceedingly sparse. While our measurements do not preclude the large-scale deployment of NAT64, they do point to substantial challenges in measuring deployments with our existing best-known methods. Finally, we also identify problems in NAT64 deployments, with gateways not following the RFC specification and also posing potential security risks.
Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.
People counting system in crowded places has become a very useful practical application that can be accomplished in various ways which include many traditional methods using sensors. Examining the case of real time scenarios, the algorithm espoused should be steadfast and accurate. People counting algorithm presented in this paper, is centered on blob assessment, devoted to yield the count of the people through a path along with the direction of traversal. The system depicted is often ensconced at the entrance of a building so that the unmitigated frequency of visitors can be recorded. The core premise of this work is to extricate count of people inflow and outflow pertaining to a particular area. The tot-up achieved can be exploited for purpose of statistics in the circumstances of any calamity occurrence in that zone. Relying upon the count totaled, the population in that vicinity can be assimilated in order to take on relevant measures to rescue the people.
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost that might limit their deployment. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation(AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model still with high masking ratio to the original masked pre-training. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the standard pre-training.
The rapid growth of wearable sensor technologies holds substantial promise for the field of personalized and context-aware Human Activity Recognition. Given the inherently decentralized nature of data sources within this domain, the utilization of multi-agent systems with their inherent decentralization capabilities presents an opportunity to facilitate the development of scalable, adaptable, and privacy-conscious methodologies. This paper introduces a collaborative distributed learning approach rooted in multi-agent principles, wherein individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the comprehensive process of learning and classifying human activities. In this proposed methodology, not only is the privacy of activity monitoring data upheld for each individual, eliminating the need for an external server to oversee the learning process, but the system also exhibits the potential to surmount the limitations of conventional centralized models and adapt to the unique attributes of each user. The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, specifically PAMAP2 and HARTH, across varying settings. The provided empirical results conclusively highlight the efficacy of inter-individual collaborative learning when contrasted with centralized configurations, both in terms of local and global generalization.
Although autonomous functioning facilitates deployment of robotic systems in domains that admit limited human oversight on our planet and beyond, finding correspondence between task requirements and autonomous capability is still an open challenge. Consequently, a number of methods for quantifying autonomy have been proposed over the last three decades, but to our knowledge all these have no discernment of sub-mode features of variation of autonomy and some are based on metrics that violet the Goodhart's law. This paper focuses on the full autonomous mode and proposes a task-requirements based autonomy assessment framework. The framework starts by establishing robot task characteristics from which three autonomy metrics, namely requisite capability, reliability and responsiveness, and functions for determining autonomy as a two-part measure, namely of level of autonomy and degree of autonomy are derived. These characteristics are founded on the realization that robots ultimately replace human skilled workers, to find a mapping between human job and robot task characteristics. The distinction between level and degree of autonomy stemmed from the acknowledgment that autonomy is not just a question of existence, but also one of performance of requisite capability. When continuously monitored, the proposed metrics provide a means of monitoring the integrity of a system. The framework has been demonstrated on two case studies, namely autonomous vehicle at an on-road dynamic driving task and the DARPA subT challenge rules analysis. The framework provides not only a tool for quantifying autonomy, but also a regulatory interface and common language for autonomous systems developers and users.
Automated reverse engineering of HTML/CSS code from UI screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code. They are initialized by pre-trained models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires end-to-end finetuning, which aims to minimize the visual discrepancy between the code-rendered webpage and the original screenshot. However, the rendering is non-differentiable and causes costly overhead. We address this problem by actor-critic fine-tuning where a visual critic without rendering (ViCR) is developed to predict visual discrepancy given the original and generated code. To train and evaluate our models, we created two synthetic datasets of varying complexity, with over 75,000 unique (code, screenshot) pairs. We evaluate the UI-to-Code performance using a combination of automated metrics such as MSE, BLEU, IoU, and a novel htmlBLEU score. ViCT outperforms a strong baseline model DiT-GPT2, improving IoU from 0.64 to 0.79 and lowering MSE from 12.25 to 9.02. With much lower computational cost, it can achieve comparable performance as when using a larger decoder such as LLaMA.
While fine-tuning unleashes the potential of a pre-trained model to a specific task, it trades off the model's generalization capability on out-of-distribution (OOD) datasets. To mitigate this, robust fine-tuning aims to ensure performance on OOD datasets as well as an in-distribution (ID) dataset for which the model is being tuned. However, another criterion for reliable machine learning (ML), confidence calibration, has been overlooked despite its increasing demand for real-world high-stakes ML applications (e.g., autonomous driving and medical diagnosis). For the first time, we raise concerns about the calibration of fine-tuned vision-language models (VLMs) under distribution shift by showing that naive fine-tuning and even state-of-the-art robust fine-tuning methods hurt the calibration of pre-trained VLMs, especially on OOD datasets. To address this, we provide a simple approach, called a calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness on both ID and OOD datasets. Empirical results on ImageNet-1K distribution shift evaluation verify the effectiveness of our method.
Advancement in the field of machine learning is unavoidable, but something of major concern is preserving the privacy of the users whose data is being used for training these machine learning algorithms. Federated learning(FL) has emerged as a promising paradigm for training machine learning models in a distributed and privacy-preserving manner which enables one to collaborate and train a global model without sharing local data. But starting this learning process on each device in the right way, called ``model initialization" is critical. The choice of initialization methods used for models plays a crucial role in the performance, convergence speed, communication efficiency, privacy guarantees of federated learning systems, etc. In this survey, we dive deeper into a comprehensive study of various ways of model initialization techniques in FL.Unlike other studies, our research meticulously compares, categorizes, and delineates the merits and demerits of each technique, examining their applicability across diverse FL scenarios. We highlight how factors like client variability, data non-IIDness, model caliber, security considerations, and network restrictions influence FL model outcomes and propose how strategic initialization can address and potentially rectify many such challenges. The motivation behind this survey is to highlight that the right start can help overcome challenges like varying data quality, security issues, and network problems. Our insights provide a foundational base for experts looking to fully utilize FL, also while understanding the complexities of model initialization.
Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.