In the context of large samples, a small number of individuals might spoil basic statistical indicators like the mean. It is difficult to detect automatically these atypical individuals, and an alternative strategy is using robust approaches. This paper focuses on estimating the geometric median of a random variable, which is a robust indicator of central tendency. In order to deal with large samples of data arriving sequentially, online stochastic Newton algorithms for estimating the geometric median are introduced and we give their rates of convergence. Since estimates of the median and those of the Hessian matrix can be recursively updated, we also determine confidences intervals of the median in any designated direction and perform online statistical tests.
We developed a new method PROTES for black-box optimization, which is based on the probabilistic sampling from a probability density function given in the low-parametric tensor train format. We tested it on complex multidimensional arrays and discretized multivariable functions taken, among others, from real-world applications, including unconstrained binary optimization and optimal control problems, for which the possible number of elements is up to $2^{100}$. In numerical experiments, both on analytic model functions and on complex problems, PROTES outperforms existing popular discrete optimization methods (Particle Swarm Optimization, Covariance Matrix Adaptation, Differential Evolution, and others).
The quality of the inferences we make from pathogen sequence data is determined by the number and composition of pathogen sequences that make up the sample used to drive that inference. However, there remains limited guidance on how to best structure and power studies when the end goal is phylogenetic inference. One question that we can attempt to answer with molecular data is whether some people are more likely to transmit a pathogen than others. Here we present an estimator to quantify differential transmission, as measured by the ratio of reproductive numbers between people with different characteristics, using transmission pairs linked by molecular data, along with a sample size calculation for this estimator. We also provide extensions to our method to correct for imperfect identification of transmission linked pairs, overdispersion in the transmission process, and group imbalance. We validate this method via simulation and provide tools to implement it in an R package, phylosamp.
The model-X knockoffs framework provides a flexible tool for achieving finite-sample false discovery rate (FDR) control in variable selection in arbitrary dimensions without assuming any dependence structure of the response on covariates. It also completely bypasses the use of conventional p-values, making it especially appealing in high-dimensional nonlinear models. Existing works have focused on the setting of independent and identically distributed observations. Yet time series data is prevalent in practical applications in various fields such as economics and social sciences. This motivates the study of model-X knockoffs inference for time series data. In this paper, we make some initial attempt to establish the theoretical and methodological foundation for the model-X knockoffs inference for time series data. We suggest the method of time series knockoffs inference (TSKI) by exploiting the ideas of subsampling and e-values to address the difficulty caused by the serial dependence. We also generalize the robust knockoffs inference to the time series setting and relax the assumption of known covariate distribution required by model-X knockoffs, because such an assumption is overly stringent for time series data. We establish sufficient conditions under which TSKI achieves the asymptotic FDR control. Our technical analysis reveals the effects of serial dependence and unknown covariate distribution on the FDR control. We conduct power analysis of TSKI using the Lasso coefficient difference knockoff statistic under linear time series models. The finite-sample performance of TSKI is illustrated with several simulation examples and an economic inflation study.
Accurate and efficient estimation of rare events probabilities is of significant importance, since often the occurrences of such events have widespread impacts. The focus in this work is on precisely quantifying these probabilities, often encountered in reliability analysis of complex engineering systems, based on an introduced framework termed Approximate Sampling Target with Post-processing Adjustment (ASTPA), which herein is integrated with and supported by gradient-based Hamiltonian Markov Chain Monte Carlo (HMCMC) methods. The developed techniques in this paper are applicable from low- to high-dimensional stochastic spaces, and the basic idea is to construct a relevant target distribution by weighting the original random variable space through a one-dimensional output likelihood model, using the limit-state function. To sample from this target distribution, we exploit HMCMC algorithms, a family of MCMC methods that adopts physical system dynamics, rather than solely using a proposal probability distribution, to generate distant sequential samples, and we develop a new Quasi-Newton mass preconditioned HMCMC scheme (QNp-HMCMC), which is particularly efficient and suitable for high-dimensional spaces. To eventually compute the rare event probability, an original post-sampling step is devised using an inverse importance sampling procedure based on the already obtained samples. The statistical properties of the estimator are analyzed as well, and the performance of the proposed methodology is examined in detail and compared against Subset Simulation in a series of challenging low- and high-dimensional problems.
The goal of multi-objective optimization is to identify a collection of points which describe the best possible trade-offs between the multiple objectives. In order to solve this vector-valued optimization problem, practitioners often appeal to the use of scalarization functions in order to transform the multi-objective problem into a collection of single-objective problems. This set of scalarized problems can then be solved using traditional single-objective optimization techniques. In this work, we formalise this convention into a general mathematical framework. We show how this strategy effectively recasts the original multi-objective optimization problem into a single-objective optimization problem defined over sets. An appropriate class of objective functions for this new problem is the R2 utility function, which is defined as a weighted integral over the scalarized optimization problems. We show that this utility function is a monotone and submodular set function, which can be optimised effectively using greedy optimization algorithms. We analyse the performance of these greedy algorithms both theoretically and empirically. Our analysis largely focusses on Bayesian optimization, which is a popular probabilistic framework for black-box optimization.
A nonlinear sea-ice problem is considered in a least-squares finite element setting. The corresponding variational formulation approximating simultaneously the stress tensor and the velocity is analysed. In particular, the least-squares functional is coercive and continuous in an appropriate solution space and this proves the well-posedness of the problem. As the method does not require a compatibility condition between the finite element space, the formulation allows the use of piecewise polynomial spaces of the same approximation order for both the stress and the velocity approximations. A Newton-type iterative method is used to linearize the problem and numerical tests are provided to illustrate the theory.
Temporal data such as time series can be viewed as discretized measurements of the underlying function. To build a generative model for such data we have to model the stochastic process that governs it. We propose a solution by defining the denoising diffusion model in the function space which also allows us to naturally handle irregularly-sampled observations. The forward process gradually adds noise to functions, preserving their continuity, while the learned reverse process removes the noise and returns functions as new samples. To this end, we define suitable noise sources and introduce novel denoising and score-matching models. We show how our method can be used for multivariate probabilistic forecasting and imputation, and how our model can be interpreted as a neural process.
Disentangled representation learning is a challenging task that involves separating multiple factors of variation in complex data. Although various metrics for learning and evaluating disentangled representations have been proposed, it remains unclear what these metrics truly quantify and how to compare them. In this work, we study the definitions of disentanglement given by first-order equational predicates and introduce a systematic approach for transforming an equational definition into a compatible quantitative metric based on enriched category theory. Specifically, we show how to replace (i) equality with metric or divergence, (ii) logical connectives with order operations, (iii) universal quantifier with aggregation, and (iv) existential quantifier with the best approximation. Using this approach, we derive metrics for measuring the desired properties of a disentangled representation extractor and demonstrate their effectiveness on synthetic data. Our proposed approach provides practical guidance for researchers in selecting appropriate evaluation metrics and designing effective learning algorithms for disentangled representation learning.
A proper fusion of complex data is of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. It plays a crucial role in: synthetic description of data processes or whole domains, creation of rule bases for approximate reasoning tasks, reaching consensus and selection of the optimal strategy in decision support systems, imputation of missing values, data deduplication and consolidation, record linkage across heterogeneous databases, and clustering. This open-access research monograph integrates the spread-out results from different domains using the methodology of the well-established classical aggregation framework, introduces researchers and practitioners to Aggregation 2.0, as well as points out the challenges and interesting directions for further research.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.