Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The performance of classical LDA, however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for the ill-posed inverse problem. Existing approaches for HDLSS data classification typically assume the data in question are with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical analysis and extensive experiments on EHR datasets.
Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.
The Naive Bayesian classifier is a popular classification method employing the Bayesian paradigm. The concept of having conditional dependence among input variables sounds good in theory but can lead to a majority vote style behaviour. Achieving conditional independence is often difficult, and they introduce decision biases in the estimates. In Naive Bayes, certain features are called independent features as they have no conditional correlation or dependency when predicting a classification. In this paper, we focus on the optimal partition of features by proposing a novel technique called the Comonotone-Independence Classifier (CIBer) which is able to overcome the challenges posed by the Naive Bayes method. For different datasets, we clearly demonstrate the efficacy of our technique, where we achieve lower error rates and higher or equivalent accuracy compared to models such as Random Forests and XGBoost.
Sparse principal component analysis (SPCA) is widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) are still unknown. This paper aims to address this critical gap. We first revisit the SPCA algorithm of Zou et al. (2006) and present our implementation. We also study a computationally more efficient variant of the SPCA algorithm in Zou et al. (2006) that can be considered as the limiting case of SPCA. We provide the guarantees of convergence to a stationary point for both algorithms and prove that, under a sparse spiked covariance model, both algorithms can recover the principal subspace consistently under mild regularity conditions. We show that their estimation error bounds match the best available bounds of existing works or the minimax rates up to some logarithmic factors. Moreover, we demonstrate the competitive numerical performance of both algorithms in numerical studies.
We present RETA (Relative Timing Analysis), a differential timing analysis technique to verify the impact of an update on the execution time of embedded software. Timing analysis is computationally expensive and labor intensive. Software updates render repeating the analysis from scratch a waste of resources and time, because their impact is inherently confined. To determine this boundary, in RETA we apply a slicing procedure that identifies all relevant code segments and a statement categorization that determines how to analyze each such line of code. We adapt a subset of RETA for integration into aiT, an industrial timing analysis tool, and also develop a complete implementation in a tool called DELTA. Based on staple benchmarks and realistic code updates from official repositories, we test the accuracy by analyzing the worst-case execution time (WCET) before and after an update, comparing the measures with the use of the unmodified aiT as well as real executions on embedded hardware. DELTA returns WCET information that ranges from exactly the WCET of real hardware to 148% of the new version's measured WCET. With the same benchmarks, the unmodified aiT estimates are 112% and 149% of the actual executions; therefore, even when DELTA is pessimistic, an industry-strength tool such as aiT cannot do better. Crucially, we also show that RETA decreases aiT's analysis time by 45% and its memory consumption by 8.9%, whereas removing RETA from DELTA, effectively rendering it a regular timing analysis tool, increases its analysis time by 27%.
We present a computationally-efficient strategy to initialise the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. Our strategy can be used as a pretraining stage to find initial conditions for maximum-likelihood (ML) training, or as a standalone method to compute hyperparameters values to be plugged in directly into the GP model. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide hyperparameter values that are close to those found via ML. In practice, we identify the GP hyperparameters by projecting the empirical covariance or (Fourier) power spectrum onto a parametric family, thus proposing and studying various measures of discrepancy operating on the temporal and frequency domains. Our contribution extends the variogram method developed by the geostatistics literature and, accordingly, it is referred to as the generalised variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.
Network structures underlie the dynamics of many complex phenomena, from gene regulation and foodwebs to power grids and social media. Yet, as they often cannot be observed directly, their connectivities must be inferred from observations of their emergent dynamics. In this work we present a powerful computational method to infer large network adjacency matrices from time series data using a neural network, in order to provide uncertainty quantification on the prediction in a manner that reflects both the non-convexity of the inference problem as well as the noise on the data. This is useful since network inference problems are typically underdetermined, and a feature that has hitherto been lacking from such methods. We demonstrate our method's capabilities by inferring line failure locations in the British power grid from its response to a power cut. Since the problem is underdetermined, many classical statistical tools (e.g. regression) will not be straightforwardly applicable. Our method, in contrast, provides probability densities on each edge, allowing the use of hypothesis testing to make meaningful probabilistic statements about the location of the power cut. We also demonstrate our method's ability to learn an entire cost matrix for a non-linear model of economic activity in Greater London. Our method outperforms OLS regression on noisy data in terms of both speed and prediction accuracy, and scales as $N^2$ where OLS is cubic. Not having been specifically engineered for network inference, our method represents a general parameter estimation scheme that is applicable to any parameter dimension.
High stakes classification refers to classification problems where erroneously predicting the wrong class is very bad, but assigning "unknown" is acceptable. We make the argument that these problems require us to give multiple unknown classes, to get the most information out of our analysis. With imperfect data we refer to covariates with a large number of missing values, large noise variance, and some errors in the data. The combination of high stakes classification and imperfect data is very common in practice, but it is very difficult to work on using current methods. We present a one-class classifier (OCC) to solve this problem, and we call it NBP. The classifier is based on Naive Bayes, simple to implement, and interpretable. We show that NBP gives both good predictive performance, and works for high stakes classification based on imperfect data. The model we present is quite simple; it is just an OCC based on density estimation. However, we have always felt a big gap between the applied classification problems we have worked on and the theory and models we use for classification, and this model closes that gap. Our main contribution is the motivation for why this model is a good approach, and we hope that this paper will inspire further development down this path.
Neural operator architectures approximate operators between infinite-dimensional Banach spaces of functions. They are gaining increased attention in computational science and engineering, due to their potential both to accelerate traditional numerical methods and to enable data-driven discovery. A popular variant of neural operators is the Fourier neural operator (FNO). Previous analysis proving universal operator approximation theorems for FNOs resorts to use of an unbounded number of Fourier modes and limits the basic form of the method to problems with periodic geometry. Prior work relies on intuition from traditional numerical methods, and interprets the FNO as a nonstandard and highly nonlinear spectral method. The present work challenges this point of view in two ways: (i) the work introduces a new broad class of operator approximators, termed nonlocal neural operators (NNOs), which allow for operator approximation between functions defined on arbitrary geometries, and includes the FNO as a special case; and (ii) analysis of the NNOs shows that, provided this architecture includes computation of a spatial average (corresponding to retaining only a single Fourier mode in the special case of the FNO) it benefits from universal approximation. It is demonstrated that this theoretical result unifies the analysis of a wide range of neural operator architectures. Furthermore, it sheds new light on the role of nonlocality, and its interaction with nonlinearity, thereby paving the way for a more systematic exploration of nonlocality, both through the development of new operator learning architectures and the analysis of existing and new architectures.
Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.