亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.

相關內容

Despite excellent average-case performance of many image classifiers, their performance can substantially deteriorate on semantically coherent subgroups of the data that were under-represented in the training data. These systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift. A major challenge is to identify such subgroups with subpar performance when the subgroups are not annotated and their occurrence is very rare. We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance on the prompt-conditioned synthesized data. To tackle the exponentially growing number of subgroups, we employ combinatorial testing. We denote this procedure as PromptAttack as it can be interpreted as an adversarial attack in a prompt space. We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify novel systematic errors on rare subgroups.

Statistical optimality benchmarking is crucial for analyzing and designing time series classification (TSC) algorithms. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is an optimal classifier by the Neyman-Pearson lemma. The LRT benchmarks are computationally efficient because the LRT does not need training, and the diffusion processes can be efficiently simulated and are flexible to reflect the specific features of real-world applications. We demonstrate the benchmarking with three widely-used TSC algorithms: random forest, ResNet, and ROCKET. These algorithms can achieve the LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying high-dimensional nonlinear multivariate time series. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series.

This paper addresses the deconvolution problem of estimating a square-integrable probability density from observations contaminated with additive measurement errors having a known density. The estimator begins with a density estimate of the contaminated observations and minimizes a reconstruction error penalized by an integrated squared $m$-th derivative. Theory for deconvolution has mainly focused on kernel- or wavelet-based techniques, but other methods including spline-based techniques and this smoothness-penalized estimator have been found to outperform kernel methods in simulation studies. This paper fills in some of these gaps by establishing asymptotic guarantees for the smoothness-penalized approach. Consistency is established in mean integrated squared error, and rates of convergence are derived for Gaussian, Cauchy, and Laplace error densities, attaining some lower bounds already in the literature. The assumptions are weak for most results; the estimator can be used with a broader class of error densities than the deconvoluting kernel. Our application example estimates the density of the mean cytotoxicity of certain bacterial isolates under random sampling; this mean cytotoxicity can only be measured experimentally with additive error, leading to the deconvolution problem. We also describe a method for approximating the solution by a cubic spline, which reduces to a quadratic program.

Mixture distributions with dynamic weights are an efficient way of modeling loss data characterized by heavy tails. However, maximum likelihood estimation of this family of models is difficult, mostly because of the need to evaluate numerically an intractable normalizing constant. In such a setup, simulation-based estimation methods are an appealing alternative. The approximate maximum likelihood estimation (AMLE) approach is employed. It is a general method that can be applied to mixtures with any component densities, as long as simulation is feasible. The focus is on the dynamic lognormal-generalized Pareto distribution, and the Cram\'er - von Mises distance is used to measure the discrepancy between observed and simulated samples. After deriving the theoretical properties of the estimators, a hybrid procedure is developed, where standard maximum likelihood is first employed to determine the bounds of the uniform priors required as input for AMLE. Simulation experiments and two real-data applications suggest that this approach yields a major improvement with respect to standard maximum likelihood estimation.

Traditional methods for inference in change point detection often rely on a large number of observed data points and can be inaccurate in non-asymptotic settings. With the rise of mobile health and digital phenotyping studies, where patients are monitored through the use of smartphones or other digital devices, change point detection is needed in non-asymptotic settings where it may be important to identify behavioral changes that occur just days before an adverse event such as relapse or suicide. Furthermore, analytical and computationally efficient means of inference are necessary for the monitoring and online analysis of large-scale digital phenotyping cohorts. We extend the result for asymptotic tail probabilities of the likelihood ratio test to the multivariate change point detection setting, and demonstrate through simulation its inaccuracy when the number of observed data points is not large. We propose a non-asymptotic approach for inference on the likelihood ratio test, and compare the efficiency of this estimated p-value to the popular empirical p-value obtained through simulation of the null distribution. The accuracy and power of this approach relative to competing methods is demonstrated through simulation and through the detection of a change point in the behavior of a patient with schizophrenia in the week prior to relapse.

Penalized logistic regression is extremely useful for binary classification with large number of covariates (higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its influence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification.

Modern datasets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex dependencies among (possibly many) variables of mixed data types. To address these challenges, we develop a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate count, continuous, ordinal, and unordered categorical variables, and deploy this model for inference, prediction, and imputation of missing data. Most uniquely, we introduce a new and computationally efficient strategy for marginal distribution estimation that eliminates the need to specify any marginal models yet delivers posterior consistency for each marginal distribution and the copula parameters under missingness-at-random. Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution.

Making causal inferences from observational studies can be challenging when confounders are missing not at random. In such cases, identifying causal effects is often not guaranteed. Motivated by a real example, we consider a treatment-independent missingness assumption under which we establish the identification of causal effects when confounders are missing not at random. We propose a weighted estimating equation (WEE) approach for estimating model parameters and introduce three estimators for the average causal effect, based on regression, propensity score weighting, and doubly robust estimation. We evaluate the performance of these estimators through simulations, and provide a real data analysis to illustrate our proposed method.

Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.

Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.

北京阿比特科技有限公司