Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at //github.com/alibaba-damo-academy/3D-Speaker.
We present improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem. The widely used "optimism in the face of uncertainty" principle reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. The performance of the resulting bandit algorithm depends on the size of the confidence sequence, with smaller confidence sets yielding better empirical performance and stronger regret guarantees. In this work, we use a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming. We prove that a linear bandit algorithm based on our confidence sequences is guaranteed to achieve competitive worst-case regret. We show that our confidence sequences are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter confidence sequences give improved performance in several hyperparameter tuning tasks.
A crucial task for a randomized controlled trial (RCT) is to specify a statistical method that can yield an efficient estimator and powerful test for the treatment effect. A novel and effective strategy to obtain efficient and powerful treatment effect inferences is to incorporate predictions from generative artificial intelligence (AI) algorithms into covariate adjustment for the regression analysis of a RCT. Training a generative AI algorithm on historical control data enables one to construct a digital twin generator (DTG) for RCT participants, which utilizes a participant's baseline covariates to generate a probability distribution for their potential control outcome. Summaries of the probability distribution from the DTG are highly predictive of the trial outcome, and adjusting for these features via regression can thus improve the quality of treatment effect inferences, while satisfying regulatory guidelines on statistical analyses, for a RCT. However, a critical assumption in this strategy is homoskedasticity, or constant variance of the outcome conditional on the covariates. In the case of heteroskedasticity, existing covariate adjustment methods yield inefficient estimators and underpowered tests. We propose to address heteroskedasticity via a weighted prognostic covariate adjustment methodology (Weighted PROCOVA) that adjusts for both the mean and variance of the regression model using information obtained from the DTG. We prove that our method yields unbiased treatment effect estimators, and demonstrate via comprehensive simulation studies and case studies from Alzheimer's disease that it can reduce the variance of the treatment effect estimator, maintain the Type I error rate, and increase the power of the test for the treatment effect from 80% to 85%~90% when the variances from the DTG can explain 5%~10% of the variation in the RCT participants' outcomes.
The field of quadrotor motion planning has experienced significant advancements over the last decade. Most successful approaches rely on two stages: a front-end that determines the best path by incorporating geometric (and in some cases kinematic or input) constraints, that effectively specify the homotopy class of the trajectory; and a back-end that optimizes the path with a suitable objective function, constrained by the robot's dynamics as well as state/input constraints. However, there is no systematic approach or design guidelines to design both the front and the back ends for a wide range of environments, and no literature evaluates the performance of the trajectory planning algorithm with varying degrees of environment complexity. In this paper, we propose a modular approach to designing the software planning stack and offer a parameterized set of environments to systematically evaluate the performance of two-stage planners. Our parametrized environments enable us to access different front and back-end planners as a function of environmental clutter and complexity. We use simulation and experimental results to demonstrate the performance of selected planning algorithms across a range of environments. Finally, we open source the planning/evaluation stack and parameterized environments to facilitate more in-depth studies of quadrotor motion planning, available at //github.com/KumarRobotics/kr_mp_design
Previous efforts on reconfigurable analog circuits mostly focused on specialized analog circuits, produced through careful co-design, or on highly reconfigurable, but relatively resource inefficient, accelerators that implement analog compute paradigms. This work deals with an intermediate point in the design space: Specialized reconfigurable circuits for analog compute paradigms. This class of circuits requires new methodologies for performing co-design, as prior techniques are typically highly specialized to conventional circuit classes (e.g., filters, ADCs). In this context, we present Ark, a programming language for describing analog compute paradigms. Ark enables progressive incorporation of analog behaviors into computations, and deploys a validator and dynamical system compiler for verifying and simulating computations. We use Ark to codify the design space for three different exemplary circuit design problems, and demonstrate that Ark helps exploring design trade-offs and evaluating the impact of nonidealities to the computation.
Most autonomous navigation systems assume wheeled robots are rigid bodies and their 2D planar workspaces can be divided into free spaces and obstacles. However, recent wheeled mobility research, showing that wheeled platforms have the potential of moving over vertically challenging terrain (e.g., rocky outcroppings, rugged boulders, and fallen tree trunks), invalidate both assumptions. Navigating off-road vehicle chassis with long suspension travel and low tire pressure in places where the boundary between obstacles and free spaces is blurry requires precise 3D modeling of the interaction between the chassis and the terrain, which is complicated by suspension and tire deformation, varying tire-terrain friction, vehicle weight distribution and momentum, etc. In this paper, we present a learning approach to model wheeled mobility, i.e., in terms of vehicle-terrain forward dynamics, and plan feasible, stable, and efficient motion to drive over vertically challenging terrain without rolling over or getting stuck. We present physical experiments on two wheeled robots and show that planning using our learned model can achieve up to 60% improvement in navigation success rate and 46% reduction in unstable chassis roll and pitch angles.
Graph Neural Networks (GNNs) have received considerable attention on graph-structured data learning for a wide variety of tasks. The well-designed propagation mechanism which has been demonstrated effective is the most fundamental part of GNNs. Although most of GNNs basically follow a message passing manner, litter effort has been made to discover and analyze their essential relations. In this paper, we establish a surprising connection between different propagation mechanisms with a unified optimization problem, showing that despite the proliferation of various GNNs, in fact, their proposed propagation mechanisms are the optimal solution optimizing a feature fitting function over a wide class of graph kernels with a graph regularization term. Our proposed unified optimization framework, summarizing the commonalities between several of the most representative GNNs, not only provides a macroscopic view on surveying the relations between different GNNs, but also further opens up new opportunities for flexibly designing new GNNs. With the proposed framework, we discover that existing works usually utilize naive graph convolutional kernels for feature fitting function, and we further develop two novel objective functions considering adjustable graph kernels showing low-pass or high-pass filtering capabilities respectively. Moreover, we provide the convergence proofs and expressive power comparisons for the proposed models. Extensive experiments on benchmark datasets clearly show that the proposed GNNs not only outperform the state-of-the-art methods but also have good ability to alleviate over-smoothing, and further verify the feasibility for designing GNNs with our unified optimization framework.
Translational distance-based knowledge graph embedding has shown progressive improvements on the link prediction task, from TransE to the latest state-of-the-art RotatE. However, N-1, 1-N and N-N predictions still remain challenging. In this work, we propose a novel translational distance-based approach for knowledge graph link prediction. The proposed method includes two-folds, first we extend the RotatE from 2D complex domain to high dimension space with orthogonal transforms to model relations for better modeling capacity. Second, the graph context is explicitly modeled via two directed context representations. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference. The proposed approach effectively improves prediction accuracy on the difficult N-1, 1-N and N-N cases for knowledge graph link prediction task. The experimental results show that it achieves better performance on two benchmark data sets compared to the baseline RotatE, especially on data set (FB15k-237) with many high in-degree connection nodes.
Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. While many different methods have been proposed, there is a lack of a unifying framework that would lead to state-of-the-art results. Here we develop PathCon, a knowledge graph completion method that harnesses four novel insights to outperform existing methods. PathCon predicts relations between a pair of entities by: (1) Considering the Relational Context of each entity by capturing the relation types adjacent to the entity and modeled through a novel edge-based message passing scheme; (2) Considering the Relational Paths capturing all paths between the two entities; And, (3) adaptively integrating the Relational Context and Relational Path through a learnable attention mechanism. Importantly, (4) in contrast to conventional node-based representations, PathCon represents context and path only using the relation types, which makes it applicable in an inductive setting. Experimental results on knowledge graph benchmarks as well as our newly proposed dataset show that PathCon outperforms state-of-the-art knowledge graph completion methods by a large margin. Finally, PathCon is able to provide interpretable explanations by identifying relations that provide the context and paths that are important for a given predicted relation.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.