Non-parametric maximum likelihood estimation encompasses a group of classic methods to estimate distribution-associated functions from potentially censored and truncated data, with extensive applications in survival analysis. These methods, including the Kaplan-Meier estimator and Turnbull's method, often result in overfitting, especially when the sample size is small. We propose an improvement to these methods by applying kernel smoothing to their raw estimates, based on a BIC-type loss function that balances the trade-off between optimizing model fit and controlling model complexity. In the context of a longitudinal study with repeated observations, we detail our proposed smoothing procedure and optimization algorithm. With extensive simulation studies over multiple realistic scenarios, we demonstrate that our smoothing-based procedure provides better overall accuracy in both survival function estimation and individual-level time-to-event prediction by reducing overfitting. Our smoothing procedure decreases the discrepancy between the estimated and true simulated survival function using interval-censored data by up to 49% compared to the raw un-smoothed estimate, with similar improvements of up to 41% and 23% in within-sample and out-of-sample prediction, respectively. Finally, we apply our method to real data on censored breast cancer diagnosis, which similarly shows improvement when compared to empirical survival estimates from uncensored data. We provide an R package, SISE, for implementing our penalized likelihood method.
Background: Survival analysis concerns the study of timeline data where the event of interest may remain unobserved (i.e., censored). Studies commonly record more than one type of event, but conventional survival techniques focus on a single event type. We set out to integrate both multiple independently censored time-to-event variables as well as missing observations. Methods: An energy-based approach is taken with a bi-partite structure between latent and visible states, commonly known as harmoniums (or restricted Boltzmann machines). Results: The present harmonium is shown, both theoretically and experimentally, to capture non-linear patterns between distinct time recordings. We illustrate on real world data that, for a single time-to-event variable, our model is on par with established methods. In addition, we demonstrate that discriminative predictions improve by leveraging an extra time-to-event variable. Conclusions: Multiple time-to-event variables can be successfully captured within the harmonium paradigm.
Given functional data from a survival process with time-dependent covariates, we derive a smooth convex representation for its nonparametric log-likelihood functional and obtain its functional gradient. From this, we devise a generic gradient boosting procedure for estimating the hazard function nonparametrically. An illustrative implementation of the procedure using regression trees is described to show how to recover the unknown hazard. The generic estimator is consistent if the model is correctly specified; alternatively, an oracle inequality can be demonstrated for tree-based models. To avoid overfitting, boosting employs several regularization devices. One of them is step-size restriction, but the rationale for this is somewhat mysterious from the viewpoint of consistency. Our work brings some clarity to this issue by revealing that step-size restriction is a mechanism for preventing the curvature of the risk from derailing convergence.
Contemporary time series analysis has seen more and more tensor type data, from many fields. For example, stocks can be grouped according to Size, Book-to-Market ratio, and Operating Profitability, leading to a 3-way tensor observation at each month. We propose an autoregressive model for the tensor-valued time series, with autoregressive terms depending on multi-linear coefficient matrices. Comparing with the traditional approach of vectoring the tensor observations and then applying the vector autoregressive model, the tensor autoregressive model preserves the tensor structure and admits corresponding interpretations. We introduce three estimators based on projection, least squares, and maximum likelihood. Our analysis considers both fixed dimensional and high dimensional settings. For the former we establish the central limit theorems of the estimators, and for the latter we focus on the convergence rates and the model selection. The performance of the model is demonstrated by simulated and real examples.
Distributional data analysis, concerned with statistical analysis and modeling for data objects consisting of random probability density functions (PDFs) in the framework of functional data analysis (FDA), has received considerable interest in recent years. However, many important aspects remain unexplored, such as outlier detection and robustness. Existing functional outlier detection methods are mainly used for ordinary functional data and usually perform poorly when applied to PDFs. To fill this gap, this study focuses on PDF-valued outlier detection, as well as its application in robust distributional regression. Similar to ordinary functional data, detecting the shape outlier masked by the "curve net" formed by the bulk of the PDFs is the major challenge in PDF-outlier detection. To this end, we propose a tree-structured transformation system for feature extraction as well as converting the shape outliers to easily detectable magnitude outliers, relevant outlier detectors are designed for the specific transformed data. A multiple detection strategy is also proposed to account for detection uncertainties and to combine different detectors to form a more reliable detection tool. Moreover, we propose a distributional-regression-based approach for detecting the abnormal associations of PDF-valued two-tuples. As a specific application, the proposed outlier detection methods are applied to robustify a distribution-to-distribution regression method, and we develop a robust estimator for the regression operator by downweighting the detected outliers. The proposed methods are validated and evaluated by extensive simulation studies or real data applications. Relevant comparative studies demonstrate the superiority of the developed outlier detection method with other competitors in distributional outlier detection.
Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.
A new family of methods involving complex coefficients for the numerical integration of differential equations is presented and analyzed. They are constructed as linear combinations of symmetric-conjugate compositions obtained from a basic time-symmetric integrator of order 2n (n $\ge$ 1). The new integrators are of order 2(n + k), k = 1, 2, ..., and preserve time-symmetry up to order 4n + 3 when applied to differential equations with real vector fields. If in addition the system is Hamiltonian and the basic scheme is symplectic, then they also preserve symplecticity up to order 4n + 3. We show that these integrators are well suited for a parallel implementation, thus improving their efficiency. Methods up to order 10 based on a 4th-order integrator are built and tested in comparison with other standard procedures to increase the order of a basic scheme.
In collaborative learning, multiple parties contribute their datasets to jointly deduce global machine learning models for numerous predictive tasks. Despite its efficacy, this learning paradigm fails to encompass critical application domains that involve highly sensitive data, such as healthcare and security analytics, where privacy risks limit entities to individually train models using only their own datasets. In this work, we target privacy-preserving collaborative hierarchical clustering. We introduce a formal security definition that aims to achieve the balance between utility and privacy and present a two-party protocol that provably satisfies it. We then extend our protocol with: (i) an optimized version for the single-linkage clustering, and (ii) scalable approximation variants. We implement all our schemes and experimentally evaluate their performance and accuracy on synthetic and real datasets, obtaining very encouraging results. For example, end-to-end execution of our secure approximate protocol for over 1M 10-dimensional data samples requires 35sec of computation and achieves 97.09% accuracy.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.
This paper addresses the problem of viewpoint estimation of an object in a given image. It presents five key insights that should be taken into consideration when designing a CNN that solves the problem. Based on these insights, the paper proposes a network in which (i) The architecture jointly solves detection, classification, and viewpoint estimation. (ii) New types of data are added and trained on. (iii) A novel loss function, which takes into account both the geometry of the problem and the new types of data, is propose. Our network improves the state-of-the-art results for this problem by 9.8%.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.