Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
Due to the high human cost of annotation, it is non-trivial to curate a large-scale medical dataset that is fully labeled for all classes of interest. Instead, it would be convenient to collect multiple small partially labeled datasets from different matching sources, where the medical images may have only been annotated for a subset of classes of interest. This paper offers an empirical understanding of an under-explored problem, namely partially supervised multi-label classification (PSMLC), where a multi-label classifier is trained with only partially labeled medical images. In contrast to the fully supervised counterpart, the partial supervision caused by medical data scarcity has non-trivial negative impacts on the model performance. A potential remedy could be augmenting the partial labels. Though vicinal risk minimization (VRM) has been a promising solution to improve the generalization ability of the model, its application to PSMLC remains an open question. To bridge the methodological gap, we provide the first VRM-based solution to PSMLC. The empirical results also provide insights into future research directions on partially supervised learning under data scarcity.
The problem of Byzantine consensus has been key to designing secure distributed systems. However, it is particularly difficult, mainly due to the presence of Byzantine processes that act arbitrarily and the unknown message delays in general networks. Although it is well known that both safety and liveness are at risk as soon as $n/3$ Byzantine processes fail, very few works attempted to characterize precisely the faults that produce safety violations from the faults that produce termination violations. In this paper, we present a new lower bound on the solvability of the consensus problem by distinguishing deceitful faults violating safety and benign faults violating termination from the more general Byzantine faults, in what we call the Byzantine-deceitful-benign fault model. We show that one cannot solve consensus if $n\leq 3t+d+2q$ with $t$ Byzantine processes, $d$ deceitful processes, and $q$ benign processes. In addition, we show that this bound is tight by presenting the Basilic class of consensus protocols that solve consensus when $n > 3t+d+2q$. These protocols differ in the number of processes from which they wait to receive messages before progressing. Each of these protocols is thus better suited for some applications depending on the predominance of benign or deceitful faults. Finally, we study the fault tolerance of the Basilic class of consensus protocols in the context of blockchains that need to solve the weaker problem of eventual consensus. We demonstrate that Basilic solves this problem with only $n > 2t+d+q$, hence demonstrating how it can strengthen blockchain security.
The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data that may not be sparse; however, existing results depend on the independent setting of samples. In this study, we analyze a linear regression model with dependent time series data under over-parameterization settings. We consider an estimator via interpolation and developed a theory for excess risk of the estimator under multiple dependence types. This theory can treat infinite-dimensional data without sparsity and handle long-memory processes in a unified manner. Moreover, we bound the risk in our theory via the integrated covariance and nondegeneracy of autocorrelation matrices. The results show that the convergence rate of risks with short-memory processes is identical to that of cases with independent data, while long-memory processes slow the convergence rate. We also present several examples of specific dependent processes that can be applied to our setting.
We provide a decision theoretic analysis of bandit experiments. The setting corresponds to a dynamic programming problem, but solving this directly is typically infeasible. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for bandit experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these equations. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework also allows for substantial generalizations to the bandit problem such as time discounting and pure exploration motives.
This paper proposes a numerical method based on the Adomian decomposition approach for the time discretization, applied to Euler equations. A recursive property is demonstrated that allows to formulate the method in an appropriate and efficient way. To obtain a fully numerical scheme, the space discretization is achieved using the classical DG techniques. The efficiency of the obtained numerical scheme is demonstrated through numerical tests by comparison to exact solution and the popular Runge-Kutta DG method results.
We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPI-GD) recently proposed for the approximate solution of empirical risk minimization problems (ERM). We focus on loss functions that are strongly convex and smooth with condition number $\sigma$. We additionally assume the loss function is $\eta$-H\"older continuous with respect to the data. The oracle complexity of LPI-GD is $\tilde{O}\left(\sigma m^d \log(1/\varepsilon)\right)$ for a desired accuracy $\varepsilon$, where $d$ is the dimension of the parameter space, and $m$ is the cardinality of an approximation grid. The factor $m^d$ can be shown to scale as $O((1/\varepsilon)^{d/2\eta})$. LPI-GD has been shown to have better oracle complexity than gradient descent (GD) and stochastic gradient descent (SGD) for certain parameter regimes. We propose two accelerated methods for the ERM problem based on LPI-GD and show an oracle complexity of $\tilde{O}\left(\sqrt{\sigma} m^d \log(1/\varepsilon)\right)$. Moreover, we provide the first empirical study on local polynomial interpolation-based gradient methods and corroborate that LPI-GD has better performance than GD and SGD in some scenarios, and the proposed methods achieve acceleration.
One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs $(\mathcal{A},\mathcal{T})$, such that if the distribution on examples in the data passes the tester $\mathcal{T}$ then one can safely trust the output of the agnostic learner $\mathcal{A}$ on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of $n^{\tilde{O}(1/\epsilon^4)}$. This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the $L_1$ and EMD distance measures. A key step in the analysis is a novel characterization of concentration and anti-concentration properties of a distribution whose low-degree moments approximately match those of a Gaussian. We also use tools from polynomial approximation theory. In contrast, we show strong lower bounds on the combined run-times of tester-learner pairs for the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over $\{0,1\}^n$. Through these lower bounds we exhibit natural problems where there is a dramatic gap between standard agnostic learning run-time and the run-time of the best tester-learner pair.
Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).
Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.