Asymptotic properties of three estimators of probability density function of sample maximum $f_{(m)}:=mfF^{m-1}$ are derived, where $m$ is a function of sample size $n$. One of the estimators is the parametrically fitted by the approximating generalized extreme value density function. However, the parametric fitting is misspecified in finite $m$ cases. The misspecification comes from mainly the following two: the difference $m$ and the selected block size $k$, and the poor approximation $f_{(m)}$ to the generalized extreme value density which depends on the magnitude of $m$ and the extreme index $\gamma$. The convergence rate of the approximation gets slower as $\gamma$ tends to zero. As alternatives two nonparametric density estimators are proposed which are free from the misspecification. The first is a plug-in type of kernel density estimator and the second is a block-maxima-based kernel density estimator. Theoretical study clarifies the asymptotic convergence rate of the plug-in type estimator is faster than the block-maxima-based estimator when $\gamma> -1$. A numerical comparative study on the bandwidth selection shows the performances of a plug-in approach and cross-validation approach depend on $\gamma$ and are totally comparable. Numerical study demonstrates that the plug-in nonparametric estimator with the estimated bandwidth by either approach overtakes the parametrically fitting estimator especially for distributions with $\gamma$ close to zero as $m$ gets large.
We propose an estimator of the kernel-based conditional mean dependence measure obtained from an appropriate modification of a naive estimator based on usual empirical estimators. We then get asymptotic normality of this estimator both under conditional mean independence hypothesis and under the alternative hypothesis. A new test for conditional mean independence of random variables valued into Hilbert spaces is then introduced.
This paper presents some results on the maximum likelihood (ML) estimation from incomplete data. Finite sample properties of conditional observed information matrices are established. They possess positive definiteness and the same Loewner partial ordering as the expected information matrices do. An explicit form of the observed Fisher information (OFI) is derived for the calculation of standard errors of the ML estimates. It simplifies Louis (1982) general formula for the OFI matrix. To prevent from getting an incorrect inverse of the OFI matrix, which may be attributed by the lack of sparsity and large size of the matrix, a monotone convergent recursive equation for the inverse matrix is developed which in turn generalizes the algorithm of Hero and Fessler (1994) for the Cram\'er-Rao lower bound. To improve the estimation, in particular when applying repeated sampling to incomplete data, a robust M-estimator is introduced. A closed form sandwich estimator of covariance matrix is proposed to provide the standard errors of the M-estimator. By the resulting loss of information presented in finite-sample incomplete data, the sandwich estimator produces smaller standard errors for the M-estimator than the ML estimates. In the case of complete information or absence of re-sampling, the M-estimator coincides with the ML estimates. Application to parameter estimation of a regime switching conditional Markov jump process is discussed to verify the results. The simulation study confirms the accuracy and asymptotic properties of the M-estimator.
We develop a new approach to drifting games, a class of two-person games with many applications to boosting and online learning settings, including Prediction with Expert Advice and the Hedge game. Our approach involves (a) guessing an asymptotically optimal potential by solving an associated partial differential equation (PDE); then (b) justifying the guess, by proving upper and lower bounds on the final-time loss whose difference scales like a negative power of the number of time steps. The proofs of our potential-based upper bounds are elementary, using little more than Taylor expansion. The proofs of our potential-based lower bounds are also rather elementary, combining Taylor expansion with probabilistic or combinatorial arguments. Most previous work on asymptotically optimal strategies has used potentials obtained by solving a discrete dynamic programming principle; the arguments are complicated by their discrete nature. Our approach is facilitated by the fact that the potentials we use are explicit solutions of PDEs; the arguments are based on basic calculus. Not only is our approach more elementary, but we give new potentials and derive corresponding upper and lower bounds that match each other in the asymptotic regime.
While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
LU and Cholesky matrix factorization algorithms are core subroutines used to solve systems of linear equations (SLEs) encountered while solving an optimization problem. Standard factorization algorithms are highly efficient but remain susceptible to the accumulation of roundoff errors, which can lead solvers to return feasibility and optimality claims that are actually invalid. This paper introduces a novel approach for solving sequences of closely related SLEs encountered in nonlinear programming efficiently and without roundoff errors. Specifically, it introduces rank-one update algorithms for the roundoff-error-free (REF) factorization framework, a toolset built on integer-preserving arithmetic that has led to the development and implementation of fail-proof SLE solution subroutines for linear programming. The formal guarantees of the proposed algorithms are established through the derivation of theoretical insights. Their advantages are supported with computational experiments, which demonstrate upwards of 75x-improvements over exact factorization run-times on fully dense matrices with over one million entries. A significant advantage of the methodology is that the length of any coefficient calculated via the proposed algorithms is bounded polynomially in the size of the inputs without having to resort to greatest common divisor operations, which are required by and thereby hinder an efficient implementation of exact rational arithmetic approaches.
In this paper, we study a sequential decision making problem faced by e-commerce carriers related to when to send out a vehicle from the central depot to serve customer requests, and in which order to provide the service, under the assumption that the time at which parcels arrive at the depot is stochastic and dynamic. The objective is to maximize the number of parcels that can be delivered during the service hours. We propose two reinforcement learning approaches for solving this problem, one based on a policy function approximation (PFA) and the second on a value function approximation (VFA). Both methods are combined with a look-ahead strategy, in which future release dates are sampled in a Monte-Carlo fashion and a tailored batch approach is used to approximate the value of future states. Our PFA and VFA make a good use of branch-and-cut-based exact methods to improve the quality of decisions. We also establish sufficient conditions for partial characterization of optimal policy and integrate them into PFA/VFA. In an empirical study based on 720 benchmark instances, we conduct a competitive analysis using upper bounds with perfect information and we show that PFA and VFA greatly outperform two alternative myopic approaches. Overall, PFA provides best solutions, while VFA (which benefits from a two-stage stochastic optimization model) achieves a better tradeoff between solution quality and computing time.
Shape-constrained density estimation is an important topic in mathematical statistics. We focus on densities on $\mathbb{R}^d$ that are log-concave, and we study geometric properties of the maximum likelihood estimator (MLE) for weighted samples. Cule, Samworth, and Stewart showed that the logarithm of the optimal log-concave density is piecewise linear and supported on a regular subdivision of the samples. This defines a map from the space of weights to the set of regular subdivisions of the samples, i.e. the face poset of their secondary polytope. We prove that this map is surjective. In fact, every regular subdivision arises in the MLE for some set of weights with positive probability, but coarser subdivisions appear to be more likely to arise than finer ones. To quantify these results, we introduce a continuous version of the secondary polytope, whose dual we name the Samworth body. This article establishes a new link between geometric combinatorics and nonparametric statistics, and it suggests numerous open problems.
The hypergraph Moore bound is an elegant statement that characterizes the extremal trade-off between the girth - the number of hyperedges in the smallest cycle or even cover (a subhypergraph with all degrees even) and size - the number of hyperedges in a hypergraph. For graphs (i.e., $2$-uniform hypergraphs), a bound tight up to the leading constant was proven in a classical work of Alon, Hoory and Linial [AHL02]. For hypergraphs of uniformity $k>2$, an appropriate generalization was conjectured by Feige [Fei08]. The conjecture was settled up to an additional $\log^{4k+1} n$ factor in the size in a recent work of Guruswami, Kothari and Manohar [GKM21]. Their argument relies on a connection between the existence of short even covers and the spectrum of a certain randomly signed Kikuchi matrix. Their analysis, especially for the case of odd $k$, is significantly complicated. In this work, we present a substantially simpler and shorter proof of the hypergraph Moore bound. Our key idea is the use of a new reweighted Kikuchi matrix and an edge deletion step that allows us to drop several involved steps in [GKM21]'s analysis such as combinatorial bucketing of rows of the Kikuchi matrix and the use of the Schudy-Sviridenko polynomial concentration. Our simpler proof also obtains tighter parameters: in particular, the argument gives a new proof of the classical Moore bound of [AHL02] with no loss (the proof in [GKM21] loses a $\log^3 n$ factor), and loses only a single logarithmic factor for all $k>2$-uniform hypergraphs. As in [GKM21], our ideas naturally extend to yield a simpler proof of the full trade-off for strongly refuting smoothed instances of constraint satisfaction problems with similarly improved parameters.
We study the problem of learning nonparametric distributions in a finite mixture, and establish tight bounds on the sample complexity for learning the component distributions in such models. Namely, we are given i.i.d. samples from a pdf $f$ where $$ f=\sum_{i=1}^k w_i f_i, \quad\sum_{i=1}^k w_i=1, \quad w_i>0 $$ and we are interested in learning each component $f_i$. Without any assumptions on $f_i$, this problem is ill-posed. In order to identify the components $f_i$, we assume that each $f_i$ can be written as a convolution of a Gaussian and a compactly supported density $\nu_i$ with $\text{supp}(\nu_i)\cap \text{supp}(\nu_j)=\emptyset$. Our main result shows that $(\frac{1}{\varepsilon})^{\Omega(\log\log \frac{1}{\varepsilon})}$ samples are required for estimating each $f_i$. Unlike parametric mixtures, the difficulty does not arise from the order $k$ or small weights $w_i$, and unlike nonparametric density estimation it does not arise from the curse of dimensionality, irregularity, or inhomogeneity. The proof relies on a fast rate for approximation with Gaussians, which may be of independent interest. To show this is tight, we also propose an algorithm that uses $(\frac{1}{\varepsilon})^{O(\log\log \frac{1}{\varepsilon})}$ samples to estimate each $f_i$. Unlike existing approaches to learning latent variable models based on moment-matching and tensor methods, our proof instead involves a delicate analysis of an ill-conditioned linear system via orthogonal functions. Combining these bounds, we conclude that the optimal sample complexity of this problem properly lies in between polynomial and exponential, which is not common in learning theory.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.