Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs.
Achievability in information theory refers to demonstrating a coding strategy that accomplishes a prescribed performance benchmark for the underlying task. In quantum information theory, the crafted Hayashi-Nagaoka operator inequality is an essential technique in proving a wealth of one-shot achievability bounds since it effectively resembles a union bound in various problems. In this work, we show that the pretty-good measurement naturally plays a role as the union bound as well. A judicious application of it considerably simplifies the derivation of one-shot achievability for classical-quantum (c-q) channel coding via an elegant three-line proof. The proposed analysis enjoys the following favorable features. (i) The established one-shot bound admits a closed-form expression as in the celebrated Holevo-Helstrom Theorem. Namely, the error probability of sending $M$ messages through a c-q channel is upper bounded by the minimum error of distinguishing the joint channel input-output state against $(M-1)$ decoupled products states. (ii) Our bound directly yields asymptotic results in the large deviation, small deviation, and moderate deviation regimes in a unified manner. (iii) The coefficients incurred in applying the Hayashi-Nagaoka operator inequality are no longer needed. Hence, the derived one-shot bound sharpens existing results relying on the Hayashi-Nagaoka operator inequality. In particular, we obtain the tightest achievable $\epsilon$-one-shot capacity for c-q channel coding heretofore, improving the third-order coding rate in the asymptotic scenario. (iv) Our result holds for infinite-dimensional Hilbert space. (v) The proposed method applies to deriving one-shot achievability for classical data compression with quantum side information, entanglement-assisted classical communication over quantum channels, and various quantum network information-processing protocols.
Although robust statistical estimators are less affected by outlying observations, their computation is usually more challenging. This is particularly the case in high-dimensional sparse settings. The availability of new optimization procedures, mainly developed in the computer science domain, offers new possibilities for the field of robust statistics. This paper investigates how such procedures can be used for robust sparse association estimators. The problem can be split into a robust estimation step followed by an optimization for the remaining decoupled, (bi-)convex problem. A combination of the augmented Lagrangian algorithm and adaptive gradient descent is implemented to also include suitable constraints for inducing sparsity. We provide results concerning the precision of the algorithm and show the advantages over existing algorithms in this context. High-dimensional empirical examples underline the usefulness of this procedure. Extensions to other robust sparse estimators are possible.
Previously, non-autoregressive models were widely perceived as being superior in generation efficiency but inferior in generation quality due to the difficulties of modeling multiple target modalities. To enhance the multi-modality modeling ability, we propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling. The modality diffusion process is a discrete process that interpolates the multi-modal distribution along the decoding steps, and the residual glancing sampling approach guides the model to continuously learn the remaining modalities across the layers. Experimental results on various machine translation and text generation benchmarks demonstrate that DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.
While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
When only few data samples are accessible, utilizing structural prior knowledge is essential for estimating covariance matrices and their inverses. One prominent example is knowing the covariance matrix to be Toeplitz structured, which occurs when dealing with wide sense stationary (WSS) processes. This work introduces a novel class of positive definiteness ensuring likelihood-based estimators for Toeplitz structured covariance matrices (CMs) and their inverses. In order to accomplish this, we derive positive definiteness enforcing constraint sets for the Gohberg-Semencul (GS) parameterization of inverse symmetric Toeplitz matrices. Motivated by the relationship between the GS parameterization and autoregressive (AR) processes, we propose hyperparameter tuning techniques, which enable our estimators to combine advantages from state-of-the-art likelihood and non-parametric estimators. Moreover, we present a computationally cheap closed-form estimator, which is derived by maximizing an approximate likelihood. Due to the ensured positive definiteness, our estimators perform well for both the estimation of the CM and the inverse covariance matrix (ICM). Extensive simulation results validate the proposed estimators' efficacy for several standard Toeplitz structured CMs commonly employed in a wide range of applications.
I propose a new identification-robust test for the structural parameter in a heteroskedastic linear instrumental variables model. The proposed test statistic is similar in spirit to a jackknife version of the K-statistic and the resulting test has exact asymptotic size so long as an auxiliary parameter can be consistently estimated. This is possible under approximate sparsity even when the number of instruments is much larger than the sample size. As the number of instruments is allowed, but not required, to be large, the limiting behavior of the test statistic is difficult to examine via existing central limit theorems. Instead, I derive the asymptotic chi-squared distribution of the test statistic using a direct Gaussian approximation technique. To improve power against certain alternatives, I propose a simple combination with the sup-score statistic of Belloni et al. (2012) based on a thresholding rule. I demonstrate favorable size control and power properties in a simulation study and apply the new methods to revisit the effect of social spillovers in movie consumption.
The integration of experimental data into mathematical and computational models is crucial for enhancing their predictive power in real-world scenarios. However, the performance of data assimilation algorithms can be significantly degraded when measurements are corrupted by biased noise, altering the signal magnitude, or when the system dynamics lack smoothness, such as in the presence of fast oscillations or discontinuities. This paper focuses on variational state estimation using the so-called Parameterized Background Data Weak method, which relies on a parameterized background by a set of constraints, enabling state estimation by solving a minimization problem on a reduced-order background model, subject to constraints imposed by the input measurements. To address biased noise in observations, a modified formulation is proposed, incorporating a correction mechanism to handle rapid oscillations by treating them as slow-decaying modes based on a two-scale splitting of the classical reconstruction algorithm. The effectiveness of the proposed algorithms is demonstrated through various examples, including discontinuous signals and simulated Doppler ultrasound data.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.