For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increasing of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features .For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.
We consider several basic questions on distributed routing in directed graphs with multiple additive costs, or metrics, and multiple constraints. Distributed routing in this sense is used in several protocols, such as IS-IS and OSPF. A practical approach to the multi-constraint routing problem is to, first, combine the metrics into a single `composite' metric, and then apply one-to-all shortest path algorithms, e.g. Dijkstra, in order to find shortest path trees. We show that, in general, even if a feasible path exists and is known for every source and destination pair, it is impossible to guarantee a distributed routing under several constraints. We also study the question of choosing the optimal `composite' metric. We show that under certain mathematical assumptions we can efficiently find a convex combination of several metrics that maximizes the number of discovered feasible paths. Sometimes it can be done analytically, and is in general possible using what we call a 'smart iterative approach'. We illustrate these findings by extensive experiments on several typical network topologies.
Bayesian cross-validation (CV) is a popular method for predictive model assessment that is simple to implement and broadly applicable. A wide range of CV schemes is available for time series applications, including generic leave-one-out (LOO) and K-fold methods, as well as specialized approaches intended to deal with serial dependence such as leave-future-out (LFO), h-block, and hv-block. Existing large-sample results show that both specialized and generic methods are applicable to models of serially-dependent data. However, large sample consistency results overlook the impact of sampling variability on accuracy in finite samples. Moreover, the accuracy of a CV scheme depends on many aspects of the procedure. We show that poor design choices can lead to elevated rates of adverse selection. In this paper, we consider the problem of identifying the regression component of an important class of models of data with serial dependence, autoregressions of order p with q exogenous regressors (ARX(p,q)), under the logarithmic scoring rule. We show that when serial dependence is present, scores computed using the joint (multivariate) density have lower variance and better model selection accuracy than the popular pointwise estimator. In addition, we present a detailed case study of the special case of ARX models with fixed autoregressive structure and variance. For this class, we derive the finite-sample distribution of the CV estimators and the model selection statistic. We conclude with recommendations for practitioners.
Markov chain Monte Carlo (MCMC) is a commonly used method for approximating expectations with respect to probability distributions. Uncertainty assessment for MCMC estimators is essential in practical applications. Moreover, for multivariate functions of a Markov chain, it is important to estimate not only the auto-correlation for each component but also to estimate cross-correlations, in order to better assess sample quality, improve estimates of effective sample size, and use more effective stopping rules. Berg and Song [2022] introduced the moment least squares (momentLS) estimator, a shape-constrained estimator for the autocovariance sequence from a reversible Markov chain, for univariate functions of the Markov chain. Based on this sequence estimator, they proposed an estimator of the asymptotic variance of the sample mean from MCMC samples. In this study, we propose novel autocovariance sequence and asymptotic variance estimators for Markov chain functions with multiple components, based on the univariate momentLS estimators from Berg and Song [2022]. We demonstrate strong consistency of the proposed auto(cross)-covariance sequence and asymptotic variance matrix estimators. We conduct empirical comparisons of our method with other state-of-the-art approaches on simulated and real-data examples, using popular samplers including the random-walk Metropolis sampler and the No-U-Turn sampler from STAN.
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.
Quantum adversarial machine learning is an emerging field that studies the vulnerability of quantum learning systems against adversarial perturbations and develops possible defense strategies. Quantum universal adversarial perturbations are small perturbations, which can make different input samples into adversarial examples that may deceive a given quantum classifier. This is a field that was rarely looked into but worthwhile investigating because universal perturbations might simplify malicious attacks to a large extent, causing unexpected devastation to quantum machine learning models. In this paper, we take a step forward and explore the quantum universal perturbations in the context of heterogeneous classification tasks. In particular, we find that quantum classifiers that achieve almost state-of-the-art accuracy on two different classification tasks can be both conclusively deceived by one carefully-crafted universal perturbation. This result is explicitly demonstrated with well-designed quantum continual learning models with elastic weight consolidation method to avoid catastrophic forgetting, as well as real-life heterogeneous datasets from hand-written digits and medical MRI images. Our results provide a simple and efficient way to generate universal perturbations on heterogeneous classification tasks and thus would provide valuable guidance for future quantum learning technologies.
In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.
With observational data alone, causal structure learning is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. Existing methods either do not allow for the presence of latent variables or assume that these remain unperturbed. However, these assumptions are hard to justify if the nature of the perturbations is unknown. We provide results that enable scoring causal structures in the setting with additive, but unknown interventions. Specifically, we propose a maximum-likelihood estimator in a structural equation model that exploits system-wide invariances to output an equivalence class of causal structures from perturbation data. Furthermore, under certain structural assumptions on the population model, we provide a simple graphical characterization of all the DAGs in the interventional equivalence class. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions. The software implementation is available as the Python package \emph{utlvce}.
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
We address the fundamental limits of learning unknown parameters of any stochastic process from time-series data, and discover exact closed-form expressions for how optimal inference scales with observation length. Given a parametrized class of candidate models, the Fisher information of observed sequence probabilities lower-bounds the variance in model estimation from finite data. As sequence-length increases, the minimal variance scales as the square inverse of the length -- with constant coefficient given by the information rate. We discover a simple closed-form expression for this information rate, even in the case of infinite Markov order. We furthermore obtain the exact analytic lower bound on model variance from the observation-induced metadynamic among belief states. We discover ephemeral, exponential, and more general modes of convergence to the asymptotic information rate. Surprisingly, this myopic information rate converges to the asymptotic Fisher information rate with exactly the same relaxation timescales that appear in the myopic entropy rate as it converges to the Shannon entropy rate for the process. We illustrate these results with a sequence of examples that highlight qualitatively distinct features of stochastic processes that shape optimal learning.
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time, and storage.