顾美玲国产一区二区三区_黄色视频在线观看男人插女人的视频在线观看_精品夜色国产国偷自产乱码_亚洲一区二区免费视频_一级伦奷片高潮无码中文字幕_久久久久精品精品6精品精品_国产日韩欧美在线

Genomic datasets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes) and thus give rise to dense latent variation, which presents both challenges and opportunities for classification. Some of these latent variables may be partially correlated with the phenotype of interest and therefore helpful, while others may be uncorrelated and thus merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. We propose the cross-residualization classifier to better account for the latent variables in genomic data. Through an adjustment and ensemble procedure, the cross-residualization classifier essentially estimates the latent variables and residualizes out their effects, trains a classifier on the residuals, and then re-integrates the the latent variables in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information that they may contribute. We apply the method to simulated data as well as a variety of genomic datasets from multiple platforms. In general, we find that the cross-residualization classifier performs well relative to existing classifiers and sometimes offers substantial gains.

相關內容

潛變量(liang)/隱變量(liang)

關注 0

分解的 · MoDELS · Better · Vine · 模型選擇 ·

2022 年 1 月 2 日

Factor tree copula models for item response data

Sayed H. Kadhem,Aristidis K. Nikoloulopoulos

Factor copula models for item response data are more interpretable and fit better than (truncated) vine copula models when dependence can be explained through latent variables, but are not robust to violations of conditional independence. To circumvent these issues, truncated vines and factor copula models for item response data are joined to define a combined model, the so-called factor tree copula model, with individual benefits from each of the two approaches. Rather than adding factors and causing computational problems and difficulties in interpretation and identification, a truncated vine structure is assumed on the residuals conditional on one or two latent variables. This structure can be better explained as a conditional dependence given a few interpretable latent variables. On the one hand, the parsimonious feature of factor models remains intact and any residual dependencies are being taken into account on the other. We discuss estimation along with model selection. In particular, we propose model selection algorithms to choose a plausible factor tree copula model to capture the (residual) dependencies among the item responses. Our general methodology is demonstrated with an extensive simulation study and illustrated by analyzing Post Traumatic Stress Disorder.

估計/估計量 · 重要性采樣 · Integration · 馬爾可夫鏈 · 散度 ·

2022 年 1 月 1 日

Estimating Cross-validatory Predictive P-values with Integrated Importance Sampling for Disease Mapping Models

Longhai Li,Cindy X. Feng,Shi Qiu

from arxiv, 18 pages. Accepted version

An important statistical task in disease mapping problems is to identify divergent regions with unusually high or low risk of disease. Leave-one-out cross-validatory (LOOCV) model assessment is the gold standard for estimating predictive p-values that can flag such divergent regions. However, actual LOOCV is time-consuming because one needs to rerun a Markov chain Monte Carlo analysis for each posterior distribution in which an observation is held out as a test case. This paper introduces a new method, called integrated importance sampling (iIS), for estimating LOOCV predictive p-values with only Markov chain samples drawn from the posterior based on a full data set. The key step in iIS is that we integrate away the latent variables associated the test observation with respect to their conditional distribution \textit{without} reference to the actual observation. By following the general theory for importance sampling, the formula used by iIS can be proved to be equivalent to the LOOCV predictive p-value. We compare iIS and other three existing methods in the literature with two disease mapping datasets. Our empirical results show that the predictive p-values estimated with iIS are almost identical to the predictive p-values estimated with actual LOOCV, and outperform those given by the existing three methods, namely, the posterior predictive checking, the ordinary importance sampling, and the ghosting method by Marshall and Spiegelhalter (2003).

binary · 線性的 · Weight · 優化器 · Sphering ·

2021 年 12 月 31 日

Quaternary linear codes and related binary subfield codes

Yansheng Wu,Chengju Li,Fu Xiao

from arxiv, 24 pages, to appear in IEEE TIT

In this paper, we mainly study quaternary linear codes and their binary subfield codes. First we obtain a general explicit relationship between quaternary linear codes and their binary subfield codes in terms of generator matrices and defining sets. Second, we construct quaternary linear codes via simplicial complexes and determine the weight distributions of these codes. Third, the weight distributions of the binary subfield codes of these quaternary codes are also computed by employing the general characterization. Furthermore, we present two infinite families of optimal linear codes with respect to the Griesmer Bound, and a class of binary almost optimal codes with respect to the Sphere Packing Bound. We also need to emphasize that we obtain at least 9 new quaternary linear codes.

估計/估計量 · 可辨認的 · INFORMS · MoDELS · 簇 ·

2021 年 12 月 31 日

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

Sara Venkatraman,Sumanta Basu,Andrew G. Clark,Sofie Delbare,Myung Hee Lee,Martin T. Wells

Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag $R^2$ (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model's parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein's unbiased risk estimate that optimally balance the ODE model's fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.

估計/估計量 · 方差 · 優化器 · 統計量 · 均值 ·

2021 年 12 月 30 日

Optimal Difference-based Variance Estimators in Time Series: A General Framework

Kin Wai Chan

from arxiv, To appear in Annals of Statistics

Variance estimation is important for statistical inference. It becomes non-trivial when observations are masked by serial dependence structures and time-varying mean structures. Existing methods either ignore or sub-optimally handle these nuisance structures. This paper develops a general framework for the estimation of the long-run variance for time series with non-constant means. The building blocks are difference statistics. The proposed class of estimators is general enough to cover many existing estimators. Necessary and sufficient conditions for consistency are investigated. The first asymptotically optimal estimator is derived. Our proposed estimator is theoretically proven to be invariant to arbitrary mean structures, which may include trends and a possibly divergent number of discontinuities.

完全數據 · 似然 · 統計量 · 估計/估計量 · Extensibility ·

2021 年 12 月 30 日

Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

Kin Wai Chan,Xiao-Li Meng

from arxiv, To appear in Statistica Sinica

Multiple imputation (MI) inference handles missing data by imputing the missing values $m$ times, and then combining the results from the $m$ complete-data analyses. However, the existing method for combining likelihood ratio tests (LRTs) has multiple defects: (i) the combined test statistic can be negative, but its null distribution is approximated by an $F$-distribution; (ii) it is not invariant to re-parametrization; (iii) it fails to ensure monotonic power owing to its use of an inconsistent estimator of the fraction of missing information (FMI) under the alternative hypothesis; and (iv) it requires nontrivial access to the LRT statistic as a function of parameters instead of data sets. We show, using both theoretical derivations and empirical investigations, that essentially all of these problems can be straightforwardly addressed if we are willing to perform an additional LRT by stacking the $m$ completed data sets as one big completed data set. This enables users to implement the MI LRT without modifying the complete-data procedure. A particularly intriguing finding is that the FMI can be estimated consistently by an LRT statistic for testing whether the $m$ completed data sets can be regarded effectively as samples coming from a common model. Practical guidelines are provided based on an extensive comparison of existing MI tests. Issues related to nuisance parameters are also discussed.

估計/估計量 · Integration · 線性的 · 泛函 · ASSETS ·

2021 年 12 月 30 日

Bayesian Quantile Regression with Multiple Proxy Variables

Dongyoung Go,Jongho Im,Ick Hoon Jin

Data integration has become more challenging with the emerging availability of multiple data sources. This paper considers Bayesian quantile regression estimation when the key covariate is not directly observed, but the unobserved covariate has multiple proxies. In a unified estimation procedure, the proposed method incorporates these multiple proxies, which have various relationships with the unobserved covariate. The proposed approach allows the inference of both the quantile function and unobserved covariate. Moreover, it requires no linearity of the quantile function or parametric assumptions on the regression error distribution and simultaneously accommodates both linear and nonlinear proxies. The simulation studies show that this methodology successfully integrates multiple proxies and reveals the quantile relationship for a wide range of nonlinear data. The proposed method is applied to the administrative data obtained from the Survey of Household Finances and Living Conditions provided by Statistics Korea. The proposed Bayesian quantile regression is implemented to specify the relationship between assets and salary income in the presence of multiple income records.

圖 · 樣例 · 縮放 · SimPLe · 注意力機制 ·

2021 年 12 月 26 日

K-Core Decomposition on Super Large Graphs with Limited Resources

Shicheng Gao,Jie Xu,Xiaosen Li,Fangcheng Fu,Wentao Zhang,Wen Ouyang,Yangyu Tao,Bin Cui

from arxiv, 10 pages, 11 figures

K-core decomposition is a commonly used metric to analyze graph structure or study the relative importance of nodes in complex graphs. Recent years have seen rapid growth in the scale of the graph, especially in industrial settings. For example, our industrial partner runs popular social applications with billions of users and is able to gather a rich set of user data. As a result, applying K-core decomposition on large graphs has attracted more and more attention from academics and the industry. A simple but effective method to deal with large graphs is to train them in the distributed settings, and some distributed K-core decomposition algorithms are also proposed. Despite their effectiveness, we experimentally and theoretically observe that these algorithms consume too many resources and become unstable on super-large-scale graphs, especially when the given resources are limited. In this paper, we deal with those super-large-scale graphs and propose a divide-and-conquer strategy on top of the distributed K-core decomposition algorithm. We evaluate our approach on three large graphs. The experimental results show that the consumption of resources can be significantly reduced, and the calculation on large-scale graphs becomes more stable than the existing methods. For example, the distributed K-core decomposition algorithm can scale to a large graph with 136 billion edges without losing correctness with our divide-and-conquer technique.

獎勵函數 · 泛函 · Performer · 情景 · 可理解性 ·

2021 年 11 月 1 日

On the Expressivity of Markov Reward

David Abel,Will Dabney,Anna Harutyunyan,Mark K. Ho,Michael L. Littman,Doina Precup,Satinder Singh

from arxiv, Accepted to NeurIPS 2021

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

可辨認的 · INFORMS · EG · entity · Performer ·

2015 年 3 月 1 日

From Data Fusion to Knowledge Fusion

Xin Luna Dong,Evgeniy Gabrilovich,Geremy Heitz,Wilko Horn,Kevin Murphy,Shaohua Sun,Wei Zhang

from arxiv, VLDB'2014

The task of {\em data fusion} is to identify the true values of data items (eg, the true date of birth for {\em Tom Cruise}) among multiple observed values drawn from different sources (eg, Web sites) of varying (and unknown) reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.