One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
Bilevel optimization has arisen as a powerful tool in modern machine learning. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations, which can be costly and unscalable in practice. Recently, Hessian-free bilevel schemes have been proposed to resolve this issue, where the general idea is to use zeroth- or first-order methods to approximate the full hypergradient of the bilevel problem. However, we empirically observe that such approximation can lead to large variance and unstable training, but estimating only the response Jacobian matrix as a partial component of the hypergradient turns out to be extremely effective. To this end, we propose a new Hessian-free method, which adopts the zeroth-order-like method to approximate the response Jacobian matrix via taking difference between two optimization paths. Theoretically, we provide the convergence rate analysis for the proposed algorithms, where our key challenge is to characterize the approximation and smoothness properties of the trajectory-dependent estimator, which can be of independent interest. This is the first known convergence rate result for this type of Hessian-free bilevel algorithms. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning with ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.
Search engines and recommendation systems attempt to continually improve the quality of the experience they afford to their users. Refining the ranker that produces the lists displayed in response to user requests is an important component of this process. A common practice is for the service providers to make changes (e.g. new ranking features, different ranking models) and A/B test them on a fraction of their users to establish the value of the change. An alternative approach estimates the effectiveness of the proposed changes offline, utilising previously collected clickthrough data on the old ranker to posit what the user behaviour on ranked lists produced by the new ranker would have been. A majority of offline evaluation approaches invoke the well studied inverse propensity weighting to adjust for biases inherent in logged data. In this paper, we propose the use of parametric estimates for these propensities. Specifically, by leveraging well known learning-to-rank methods as subroutines, we show how accurate offline evaluation can be achieved when the new rankings to be evaluated differ from the logged ones.
We introduce a new type of Krasnoselskii's result. Using a simple differentiability condition, we relax the nonexpansive condition in Krasnoselskii's theorem. More clearly, we analyze the convergence of the sequence $x_{n+1}=\frac{x_n+g(x_n)}{2}$ based on some differentiability condition of $g$ and present some fixed point results. We introduce some iterative sequences that for any real differentiable function $g$ and any starting point $x_0\in \mathbb [a,b]$ converge monotonically to the nearest root of $g$ in $[a,b]$ that lay to the right or left side of $x_0$. Based on this approach, we present an efficient and novel method for finding the real roots of real functions. We prove that no root will be missed in our method. It is worth mentioning that our iterative method is free from the derivative evaluation which can be regarded as an advantage of this method in comparison with many other methods. Finally, we illustrate our results with some numerical examples.
In this paper, we study the trace regression when a matrix of parameters B* is estimated via convex relaxation of a rank-penalized regression or via non-convex optimization. It is known that these estimators satisfy near-optimal error bounds under assumptions on rank, coherence, or spikiness of B*. We start by introducing a general notion of spikiness for B* that provides a generic recipe to prove restricted strong convexity for the sampling operator of the trace regression and obtain near-optimal and non-asymptotic error bounds for the estimation error. Similar to the existing literature, these results require the penalty parameter to be above a certain theory-inspired threshold that depends on the observation noise and the sampling operator which may be unknown in practice. Next, we extend the error bounds to the cases when the regularization parameter is chosen via cross-validation. This result is significant in that existing theoretical results on cross-validated estimators do not apply to our setting since the estimators we study are not known to satisfy their required notion of stability. Finally, using simulations on synthetic and real data, we show that the cross-validated estimator selects a nearly-optimal penalty parameter and outperforms the theory-inspired approach of selecting the parameter.
The ability to accurately predict human behavior is central to the safety and efficiency of robot autonomy in interactive settings. Unfortunately, robots often lack access to key information on which these predictions may hinge, such as people's goals, attention, and willingness to cooperate. Dual control theory addresses this challenge by treating unknown parameters of a predictive model as stochastic hidden states and inferring their values at runtime using information gathered during system operation. While able to optimally and automatically trade off exploration and exploitation, dual control is computationally intractable for general interactive motion planning, mainly due to the fundamental coupling between robot trajectory optimization and human intent inference. In this paper, we present a novel algorithmic approach to enable active uncertainty reduction for interactive motion planning based on the implicit dual control paradigm. Our approach relies on sampling-based approximation of stochastic dynamic programming, leading to a model predictive control problem that can be readily solved by real-time gradient-based optimization methods. The resulting policy is shown to preserve the dual control effect for a broad class of predictive human models with both continuous and categorical uncertainty. The efficacy of our approach is demonstrated with simulated driving examples.
The availability of massive image databases resulted in the development of scalable machine learning methods such as convolutional neural network (CNNs) filtering and processing these data. While the very recent theoretical work on CNNs focuses on standard nonparametric denoising problems, the variability in image classification datasets does, however, not originate from additive noise but from variation of the shape and other characteristics of the same object across different images. To address this problem, we consider a simple supervised classification problem for object detection on grayscale images. While from the function estimation point of view, every pixel is a variable and large images lead to high-dimensional function recovery tasks suffering from the curse of dimensionality, increasing the number of pixels in our image deformation model enhances the image resolution and makes the object classification problem easier. We propose and theoretically analyze two different procedures. The first method estimates the image deformation by support alignment. Under a minimal separation condition, it is shown that perfect classification is possible. The second method fits a CNN to the data. We derive a rate for the misclassification error depending on the sample size and the number of pixels. Both classifiers are empirically compared on images generated from the MNIST handwritten digit database. The obtained results corroborate the theoretical findings.
Evolutionary algorithms are bio-inspired algorithms that can easily adapt to changing environments. Recent results in the area of runtime analysis have pointed out that algorithms such as the (1+1)~EA and Global SEMO can efficiently reoptimize linear functions under a dynamic uniform constraint. Motivated by this study, we investigate single- and multi-objective baseline evolutionary algorithms for the classical knapsack problem where the capacity of the knapsack varies over time. We establish different benchmark scenarios where the capacity changes every $\tau$ iterations according to a uniform or normal distribution. Our experimental investigations analyze the behavior of our algorithms in terms of the magnitude of changes determined by parameters of the chosen distribution, the frequency determined by $\tau$, and the class of knapsack instance under consideration. Our results show that the multi-objective approaches using a population that caters for dynamic changes have a clear advantage on many benchmarks scenarios when the frequency of changes is not too high. Furthermore, we demonstrate that the diversity mechanisms used in popular evolutionary multi-objective algorithms such as NSGA-II and SPEA2 do not necessarily result in better performance and even lead to inferior results compared to our simple multi-objective approaches.
This paper presents a multi-scale method for convection-dominated diffusion problems in the regime of large P\'eclet numbers. The application of the solution operator to piecewise constant right-hand sides on some arbitrary coarse mesh defines a finite-dimensional coarse ansatz space with favorable approximation properties. For some relevant error measures, including the $L^2$-norm, the Galerkin projection onto this generalized finite element space even yields $\varepsilon$-independent error bounds, $\varepsilon$ being the singular perturbation parameter. By constructing an approximate local basis, the approach becomes a novel multi-scale method in the spirit of the Super-Localized Orthogonal Decomposition (SLOD). The error caused by basis localization can be estimated in an a-posteriori way. In contrast to existing multi-scale methods, numerical experiments indicate $\varepsilon$-independent convergence without preasymptotic effects even in the under-resolved regime of large mesh P\'eclet numbers.
Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wu, Zhang and Liu, 2010; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. Though not being most efficient in computation, the OVA offers the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
Covariance matrix estimation is a fundamental statistical task in many applications, but the sample covariance matrix is sub-optimal when the sample size is comparable to or less than the number of features. Such high-dimensional settings are common in modern genomics, where covariance matrix estimation is frequently employed as a method for inferring gene networks. To achieve estimation accuracy in these settings, existing methods typically either assume that the population covariance matrix has some particular structure, for example sparsity, or apply shrinkage to better estimate the population eigenvalues. In this paper, we study a new approach to estimating high-dimensional covariance matrices. We first frame covariance matrix estimation as a compound decision problem. This motivates defining a class of decision rules and using a nonparametric empirical Bayes g-modeling approach to estimate the optimal rule in the class. Simulation results and gene network inference in an RNA-seq experiment in mouse show that our approach is comparable to or can outperform a number of state-of-the-art proposals.