亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Modal regression, a widely used regression protocol, has been extensively investigated in statistical and machine learning communities due to its robustness to outliers and heavy-tailed noises. Understanding modal regression's theoretical behavior can be fundamental in learning theory. Despite significant progress in characterizing its statistical property, the majority of the results are based on the assumption that samples are independent and identical distributed (i.i.d.), which is too restrictive for real-world applications. This paper concerns the statistical property of regularized modal regression (RMR) within an important dependence structure - Markov dependent. Specifically, we establish the upper bound for RMR estimator under moderate conditions and give an explicit learning rate. Our results show that the Markov dependence impacts on the generalization error in the way that sample size would be discounted by a multiplicative factor depending on the spectral gap of underlying Markov chain. This result shed a new light on characterizing the theoretical underpinning for robust regression.

相關內容

We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an accelerated dual-descent method for entropy-regularized CMDPs. We prove that our method achieves the global convergence rate $\widetilde{\mathcal{O}}(1/T)$ for both the optimality gap and the constraint violation for entropy-regularized CMDPs. A discussion about a linear convergence rate for CMDPs with a single constraint is also provided.

We study regression adjustments with additional covariates in randomized experiments under covariate-adaptive randomizations (CARs) when subject compliance is imperfect. We develop a regression-adjusted local average treatment effect (LATE) estimator that is proven to improve efficiency in the estimation of LATEs under CARs. Our adjustments can be parametric in linear and nonlinear forms, nonparametric, and high-dimensional. Even when the adjustments are misspecified, our proposed estimator is still consistent and asymptotically normal, and their inference method still achieves the exact asymptotic size under the null. When the adjustments are correctly specified, our estimator achieves the minimum asymptotic variance. When the adjustments are parametrically misspecified, we construct a new estimator which is weakly more efficient than linearly and nonlinearly adjusted estimators, as well as the one without any adjustments. Simulation evidence and empirical application confirm efficiency gains achieved by regression adjustments relative to both the estimator without adjustment and the standard two-stage least squares estimator.

Kernel regression is an important nonparametric learning algorithm with an equivalence to neural networks in the infinite-width limit. Understanding its generalization behavior is thus an important task for machine learning theory. In this work, we provide a theory of the inductive bias and generalization of kernel regression using a new measure characterizing the "learnability" of a given target function. We prove that a kernel's inductive bias can be characterized as a fixed budget of learnability, allocated to its eigenmodes, that can only be increased with the addition of more training data. We then use this rule to derive expressions for the mean and covariance of the predicted function and gain insight into the overfitting and adversarial robustness of kernel regression and the hardness of the classic parity problem. We show agreement between our theoretical results and both kernel regression and wide finite networks on real and synthetic learning tasks.

In high-dimensional regression, we attempt to estimate a parameter vector $\beta_0\in\mathbb{R}^p$ from $n\lesssim p$ observations $\{(y_i,x_i)\}_{i\leq n}$ where $x_i\in\mathbb{R}^p$ is a vector of predictors and $y_i$ is a response variable. A well-established approach uses convex regularizers to promote specific structures (e.g. sparsity) of the estimate $\widehat{\beta}$, while allowing for practical algorithms. Theoretical analysis implies that convex penalization schemes have nearly optimal estimation properties in certain settings. However, in general the gaps between statistically optimal estimation (with unbounded computational resources) and convex methods are poorly understood. We show that when the statistican has very simple structural information about the distribution of the entries of $\beta_0$, a large gap frequently exists between the best performance achieved by any convex regularizer satisfying a mild technical condition and either (i) the optimal statistical error or (ii) the statistical error achieved by optimal approximate message passing algorithms. Remarkably, a gap occurs at high enough signal-to-noise ratio if and only if the distribution of the coordinates of $\beta_0$ is not log-concave. These conclusions follow from an analysis of standard Gaussian designs. Our lower bounds are expected to be generally tight, and we prove tightness under certain conditions.

Functional quantile regression (FQR) is a useful alternative to mean regression for functional data as it provides a comprehensive understanding of how scalar predictors influence the conditional distribution of functional responses. In this article, we study the FQR model for densely sampled, high-dimensional functional data without relying on parametric error or independent stochastic process assumptions, with the focus being on statistical inference under this challenging regime along with scalable implementation. This is achieved by a simple but powerful distributed strategy, in which we first perform separate quantile regression to compute $M$-estimators at each sampling location, and then carry out estimation and inference for the entire coefficient functions by properly exploiting the uncertainty quantification and dependence structures of $M$-estimators. We derive a uniform Bahadur representation and a strong Gaussian approximation result for the $M$-estimators on the discrete sampling grid, leading to dimension reduction and serving as the basis for inference. An interpolation-based estimator with minimax optimality and a Bayesian alternative to improve upon finite sample performance are discussed. Large sample properties for point and simultaneous interval estimators are established. The obtained minimax optimal rate under the FQR model shows an interesting phase transition phenomenon that has been previously observed in functional mean regression. The proposed methods are illustrated via simulations and an application to a mass spectrometry proteomics dataset.

Models defined by moment conditions are at the center of structural econometric estimation, but economic theory is mostly agnostic about moment selection. While a large pool of valid moments can potentially improve estimation efficiency, in the meantime a few invalid ones may undermine consistency. This paper investigates the empirical likelihood estimation of these moment-defined models in high-dimensional settings. We propose a penalized empirical likelihood (PEL) estimation and establish its oracle property with consistent detection of invalid moments. The PEL estimator is asymptotically normally distributed, and a projected PEL procedure further eliminates its asymptotic bias and provides more accurate normal approximation to the finite sample behavior. Simulation exercises demonstrate excellent numerical performance of these methods in estimation and inference.

Generalization beyond a training dataset is a main goal of machine learning, but theoretical understanding of generalization remains an open problem for many models. The need for a new theory is exacerbated by recent observations in deep neural networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. In this paper, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also includes infinitely overparameterized neural networks trained with gradient descent. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel or data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep neural networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with "simple functions", which are identified by solving a kernel eigenfunction problem on the data distribution. This notion of simplicity allows us to characterize whether a kernel is compatible with a learning task, facilitating good generalization performance from a small number of training examples. We show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks. To further understand these phenomena, we turn to the broad class of rotation invariant kernels, which is relevant to training deep neural networks in the infinite-width limit, and present a detailed mathematical analysis of them when data is drawn from a spherically symmetric distribution and the number of input dimensions is large.

Distributed learning has become a necessity for training ever-growing models. In a distributed setting, the task is shared among several devices. Typically, the learning process is monitored by a server. Also, some of the devices can be faulty, deliberately or not, and the usual distributed SGD algorithm cannot defend itself from omniscient adversaries. Therefore, we need to devise a fault-tolerant gradient descent algorithm. We based our article on the SignSGD algorithm, which relies on the sharing of gradients signs between the devices and the server. We provide a theoretical upper bound for the convergence rate of SignSGD to extend the results of the original paper. Our theoretical results estimate the convergence rate of SignSGD against a proportion of general adversaries, such as Byzantine adversaries. We implemented the algorithm along with Byzantine strategies in order to try to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory. Our code is available on GitHub and our experiments are reproducible by using the provided parameters.

We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

北京阿比特科技有限公司