亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Statistical analysis of large dataset is a challenge because of the limitation of computing devices memory and excessive computation time. Divide and Conquer (DC) algorithm is an effective solution path, but the DC algorithm has some limitations. Empirical likelihood is an important semiparametric and nonparametric statistical method for parameter estimation and statistical inference, and the estimating equation builds a bridge between empirical likelihood and traditional statistical methods, which makes empirical likelihood widely used in various traditional statistical models. In this paper, we propose a novel approach to address the challenges posed by empirical likelihood with massive data, which called split sample mean empirical likelihood(SSMEL). We show that the SSMEL estimator has the same estimation efficiency as the empirical likelihood estimatior with the full dataset, and maintains the important statistical property of Wilks' theorem, allowing our proposed approach to be used for statistical inference. The effectiveness of the proposed approach is illustrated using simulation studies and real data analysis.

相關內容

When an exposure of interest is confounded by unmeasured factors, an instrumental variable (IV) can be used to identify and estimate certain causal contrasts. Identification of the marginal average treatment effect (ATE) from IVs relies on strong untestable structural assumptions. When one is unwilling to assert such structure, IVs can nonetheless be used to construct bounds on the ATE. Famously, Balke and Pearl (1997) proved tight bounds on the ATE for a binary outcome, in a randomized trial with noncompliance and no covariate information. We demonstrate how these bounds remain useful in observational settings with baseline confounders of the IV, as well as randomized trials with measured baseline covariates. The resulting bounds on the ATE are non-smooth functionals, and thus standard nonparametric efficiency theory is not immediately applicable. To remedy this, we propose (1) under a novel margin condition, influence function-based estimators of the bounds that can attain parametric convergence rates when the nuisance functions are modeled flexibly, and (2) estimators of smooth approximations of these bounds. We propose extensions to continuous outcomes, explore finite sample properties in simulations, and illustrate the proposed estimators in a randomized experiment studying the effects of vaccination encouragement on flu-related hospital visits.

Courcelle's theorem and its adaptations to cliquewidth have shaped the field of exact parameterized algorithms and are widely considered the archetype of algorithmic meta-theorems. In the past decade, there has been growing interest in developing parameterized approximation algorithms for problems which are not captured by Courcelle's theorem and, in particular, are considered not fixed-parameter tractable under the associated widths. We develop a generalization of Courcelle's theorem that yields efficient approximation schemes for any problem that can be captured by an expanded logic we call Blocked CMSO, capable of making logical statements about the sizes of set variables via so-called weight comparisons. The logic controls weight comparisons via the quantifier-alternation depth of the involved variables, allowing full comparisons for zero-alternation variables and limited comparisons for one-alternation variables. We show that the developed framework threads the very needle of tractability: on one hand it can describe a broad range of approximable problems, while on the other hand we show that the restrictions of our logic cannot be relaxed under well-established complexity assumptions. The running time of our approximation scheme is polynomial in $1/\varepsilon$, allowing us to fully interpolate between faster approximate algorithms and slower exact algorithms. This provides a unified framework to explain the tractability landscape of graph problems parameterized by treewidth and cliquewidth, as well as classical non-graph problems such as Subset Sum and Knapsack.

Clustering is at the very core of machine learning, and its applications proliferate with the increasing availability of data. However, as datasets grow, comparing clusterings with an adjustment for chance becomes computationally difficult, preventing unbiased ground-truth comparisons and solution selection. We propose FastAMI, a Monte Carlo-based method to efficiently approximate the Adjusted Mutual Information (AMI) and extend it to the Standardized Mutual Information (SMI). The approach is compared with the exact calculation and a recently developed variant of the AMI based on pairwise permutations, using both synthetic and real data. In contrast to the exact calculation our method is fast enough to enable these adjusted information-theoretic comparisons for large datasets while maintaining considerably more accurate results than the pairwise approach.

An old problem in multivariate statistics is that linear Gaussian models are often unidentifiable, i.e. some parameters cannot be uniquely estimated. In factor (component) analysis, an orthogonal rotation of the factors is unidentifiable, while in linear regression, the direction of effect cannot be identified. For such linear models, non-Gaussianity of the (latent) variables has been shown to provide identifiability. In the case of factor analysis, this leads to independent component analysis, while in the case of the direction of effect, non-Gaussian versions of structural equation modelling solve the problem. More recently, we have shown how even general nonparametric nonlinear versions of such models can be estimated. Non-Gaussianity is not enough in this case, but assuming we have time series, or that the distributions are suitably modulated by some observed auxiliary variables, the models are identifiable. This paper reviews the identifiability theory for the linear and nonlinear cases, considering both factor analytic models and structural equation models.

Empirical likelihood enables a nonparametric, likelihood-driven style of inference without restrictive assumptions routinely made in parametric models. We develop a framework for applying empirical likelihood to the analysis of experimental designs, addressing issues that arise from blocking and multiple hypothesis testing. In addition to popular designs such as balanced incomplete block designs, our approach allows for highly unbalanced, incomplete block designs. We derive an asymptotic multivariate chi-square distribution for a set of empirical likelihood test statistics and propose two single-step multiple testing procedures: asymptotic Monte Carlo and nonparametric bootstrap. Both procedures asymptotically control the generalised family-wise error rate and efficiently construct simultaneous confidence intervals for comparisons of interest without explicitly considering the underlying covariance structure. A simulation study demonstrates that the performance of the procedures is robust to violations of standard assumptions of linear mixed models. We also present an application to experiments on a pesticide.

Geostatistical analysis of health data is increasingly used to model spatial variation in malaria prevalence, burden, and other metrics. Traditional inference methods for geostatistical modelling are notoriously computationally intensive, motivating the development of newer, approximate methods. The appeal of faster methods is particularly great as the size of the region and number of spatial locations being modelled increases. Methods We present an applied comparison of four proposed `fast' geostatistical modelling methods and the software provided to implement them -- Integrated Nested Laplace Approximation (INLA), tree boosting with Gaussian processes and mixed effect models (GPBoost), Fixed Rank Kriging (FRK) and Spatial Random Forests (SpRF). We illustrate the four methods by estimating malaria prevalence on two different spatial scales -- country and continent. We compare the performance of the four methods on these data in terms of accuracy, computation time, and ease of implementation. Results Two of these methods -- SpRF and GPBoost -- do not scale well as the data size increases, and so are likely to be infeasible for larger-scale analysis problems. The two remaining methods -- INLA and FRK -- do scale well computationally, however the resulting model fits are very sensitive to the user's modelling assumptions and parameter choices. Conclusions INLA and FRK both enable scalable geostatistical modelling of malaria prevalence data. However care must be taken when using both methods to assess the fit of the model to data and plausibility of predictions, in order to select appropriate model assumptions and approximation parameters.

Although there is substantial literature on identifying structural changes for continuous spatio-temporal processes, the same is not true for categorical spatio-temporal data. This work bridges that gap and proposes a novel spatio-temporal model to identify changepoints in ordered categorical data. The model leverages an additive mean structure with separable Gaussian space-time processes for the latent variable. Our proposed methodology can detect significant changes in the mean structure as well as in the spatio-temporal covariance structures. We implement the model through a Bayesian framework that gives a computational edge over conventional approaches. From an application perspective, our approach's capability to handle ordinal categorical data provides an added advantage in real applications. This is illustrated using county-wise COVID-19 data (converted to categories according to CDC guidelines) from the state of New York in the USA. Our model identifies three changepoints in the transmission levels of COVID-19, which are indeed aligned with the ``waves'' due to specific variants encountered during the pandemic. The findings also provide interesting insights into the effects of vaccination and the extent of spatial and temporal dependence in different phases of the pandemic.

Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.

Behaviors of the synthetic characters in current military simulations are limited since they are generally generated by rule-based and reactive computational models with minimal intelligence. Such computational models cannot adapt to reflect the experience of the characters, resulting in brittle intelligence for even the most effective behavior models devised via costly and labor-intensive processes. Observation-based behavior model adaptation that leverages machine learning and the experience of synthetic entities in combination with appropriate prior knowledge can address the issues in the existing computational behavior models to create a better training experience in military training simulations. In this paper, we introduce a framework that aims to create autonomous synthetic characters that can perform coherent sequences of believable behavior while being aware of human trainees and their needs within a training simulation. This framework brings together three mutually complementary components. The first component is a Unity-based simulation environment - Rapid Integration and Development Environment (RIDE) - supporting One World Terrain (OWT) models and capable of running and supporting machine learning experiments. The second is Shiva, a novel multi-agent reinforcement and imitation learning framework that can interface with a variety of simulation environments, and that can additionally utilize a variety of learning algorithms. The final component is the Sigma Cognitive Architecture that will augment the behavior models with symbolic and probabilistic reasoning capabilities. We have successfully created proof-of-concept behavior models leveraging this framework on realistic terrain as an essential step towards bringing machine learning into military simulations.

Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.

北京阿比特科技有限公司