Robotic arms are highly common in various automation processes such as manufacturing lines. However, these highly capable robots are usually degraded to simple repetitive tasks such as pick-and-place. On the other hand, designing an optimal robot for one specific task consumes large resources of engineering time and costs. In this paper, we propose a novel concept for optimizing the fitness of a robotic arm to perform a specific task based on human demonstration. Fitness of a robot arm is a measure of its ability to follow recorded human arm and hand paths. The optimization is conducted using a modified variant of the Particle Swarm Optimization for the robot design problem. In the proposed approach, we generate an optimal robot design along with the required path to complete the task. The approach could reduce the time-to-market of robotic arms and enable the standardization of modular robotic parts. Novice users could easily apply a minimal robot arm to various tasks. Two test cases of common manufacturing tasks are presented yielding optimal designs and reduced computational effort by up to 92%.
Graph alignment aims at finding the vertex correspondence between two correlated graphs, a task that frequently occurs in graph mining applications such as social network analysis. Attributed graph alignment is a variant of graph alignment, in which publicly available side information or attributes are exploited to assist graph alignment. Existing studies on attributed graph alignment focus on either theoretical performance without computational constraints or empirical performance of efficient algorithms. This motivates us to investigate efficient algorithms with theoretical performance guarantee. In this paper, we propose two polynomial-time algorithms that exactly recover the vertex correspondence with high probability. The feasible region of the proposed algorithms is near optimal compared to the information-theoretic limits. When specialized to the seeded graph alignment problem under the seeded Erd\H{o}s--R\'{e}nyi graph pair model, the proposed algorithms extends the best known feasible region for exact alignment by polynomial-time algorithms.
Dynamic density estimation is ubiquitous in many applications, including computer vision and signal processing. One popular method to tackle this problem is the "sliding window" kernel density estimator. There exist various implementations of this method that use heuristically defined weight sequences for the observed data. The weight sequence, however, is a key aspect of the estimator affecting the tracking performance significantly. In this work, we study the exact mean integrated squared error (MISE) of "sliding window" Gaussian Kernel Density Estimators for evolving Gaussian densities. We provide a principled guide for choosing the optimal weight sequence by theoretically characterizing the exact MISE, which can be formulated as constrained quadratic programming. We present empirical evidence with synthetic datasets to show that our weighting scheme indeed improves the tracking performance compared to heuristic approaches.
Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.
Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.
Humanoids are versatile robotic platforms owing to their limbs with multiple degrees of freedom. Although humanoids can walk like humans, they are relatively slow, and cannot run over large barriers. To address these limitations, we aim to achieve rapid terrestrial locomotion ability and simultaneously expand the locomotion domain to the air by utilizing thrust for propulsion. In this paper, we first describe an optimized construction method for a humanoid robot equipped with wheels and a flight unit to achieve these abilities. Then, we describe the integrated control framework of the proposed flying humanoid for each locomotion mode: aerial, legged, and wheeled locomotion. Finally, we achieved multimodal locomotion and aerial manipulation experiments using the proposed robot platform. To the best of our knowledge, this is the first time that a single humanoid has simultaneously achieved three different types of locomotion, including flight.
In the investigation of any causal mechanisms, such as the brain's causal networks, the assumption of causal sufficiency plays a critical role. Notably, neglecting this assumption can result in significant errors, a fact that is often disregarded in the causal analysis of brain networks. In this study, we propose an algorithmic identification approach for determining essential exogenous nodes that satisfy the critical need for causal sufficiency to adhere to it in such inquiries. Our approach consists of three main steps: First, by capturing the essence of the Peter-Clark (PC) algorithm, we conduct independence tests for pairs of regions within a network, as well as for the same pairs conditioned on nodes from other networks. Next, we distinguish candidate confounders by analyzing the differences between the conditional and unconditional results, using the Kolmogorov-Smirnov test. Subsequently, we utilize Non-Factorized identifiable Variational Autoencoders (NF-iVAE) along with the Correlation Coefficient index (CCI) metric to identify the confounding variables within these candidate nodes. Applying our method to the Human Connectome Projects (HCP) movie-watching task data, we demonstrate that while interactions exist between dorsal and ventral regions, only dorsal regions serve as confounders for the visual networks, and vice versa. These findings align consistently with those resulting from the neuroscientific perspective. Finally, we show the reliability of our results by testing 30 independent runs for NF-iVAE initialization.
We propose two extensions to existing importance sampling based methods for lossy compression. First, we introduce an importance sampling based compression scheme that is a variant of ordered random coding (Theis and Ahmed, 2022) and is amenable to direct evaluation of the achievable compression rate for a finite number of samples. Our second and major contribution is the importance matching lemma, which is a finite proposal counterpart of the recently introduced Poisson matching lemma (Li and Anantharam, 2021). By integrating with deep learning, we provide a new coding scheme for distributed lossy compression with side information at the decoder. We demonstrate the effectiveness of the proposed scheme through experiments involving synthetic Gaussian sources, distributed image compression with MNIST and vertical federated learning with CIFAR-10.
Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.