Zero-inflated continuous data ubiquitously appear in many fields, in which lots of exactly zero-valued data are observed while others distribute continuously. Due to the mixed structure of discreteness and continuity in its distribution, statistical analysis is challenging especially for multivariate case. In this paper, we propose two copula-based density estimation models that can cope with multivariate correlation among zero-inflated continuous variables. In order to overcome the difficulty in the use of copulas due to the tied-data problem in zero-inflated data, we propose a new type of copula, rectified Gaussian copula, and present efficient methods for parameter estimation and likelihood computation. Numerical experiments demonstrates the superiority of our proposals compared to conventional density estimation methods.
We are interested in the nonparametric estimation of the probability density of price returns, using the kernel approach. The output of the method heavily relies on the selection of a bandwidth parameter. Many selection methods have been proposed in the statistical literature. We put forward an alternative selection method based on a criterion coming from information theory and from the physics of complex systems: the bandwidth to be selected maximizes a new measure of complexity, with the aim of avoiding both overfitting and underfitting. We review existing methods of bandwidth selection and show that they lead to contradictory conclusions regarding the complexity of the probability distribution of price returns. This has also some striking consequences in the evaluation of the relevance of the efficient market hypothesis. We apply these methods to real financial data, focusing on the Bitcoin.
We developed a new method PROTES for black-box optimization, which is based on the probabilistic sampling from a probability density function given in the low-parametric tensor train format. We tested it on complex multidimensional arrays and discretized multivariable functions taken, among others, from real-world applications, including unconstrained binary optimization and optimal control problems, for which the possible number of elements is up to $2^{100}$. In numerical experiments, both on analytic model functions and on complex problems, PROTES outperforms existing popular discrete optimization methods (Particle Swarm Optimization, Covariance Matrix Adaptation, Differential Evolution, and others).
We consider relative model comparison for the parametric coefficients of a semiparametric ergodic L\'{e}vy driven model observed at high-frequency. Our asymptotics is based on the fully explicit two-stage Gaussian quasi-likelihood function (GQLF) of the Euler-approximation type. For selections of the scale and drift coefficients, we propose explicit Gaussian quasi-AIC (GQAIC) and Gaussian quasi-BIC (GQBIC) statistics through the stepwise inference procedure. In particular, we show that the mixed-rates structure of the joint GQLF, which does not emerge for the case of diffusions, gives rise to the non-standard forms of the regularization terms in the selection of the scale coefficient, quantitatively clarifying the relation between estimation precision and sampling frequency. Numerical experiments are given to illustrate our theoretical findings.
Empirical best prediction (EBP) is a well-known method for producing reliable proportion estimates when the primary data source provides only small or no sample from finite populations. There are at least two potential challenges encountered in implementing the existing EBP methodology. First, one must accurately link the sample to the finite population frame. This may be a difficult or even impossible task because of absence of identifiers that can be used to link sample and the frame. Secondly, the finite population frame typically contains limited auxiliary variables, which may not be adequate for building a reasonable working predictive model. We propose a data linkage approach in which we replace the finite population frame by a big sample that does not have the outcome binary variable of interest, but has a large set of auxiliary variables. Our proposed method calls for fitting the assumed model using data from the smaller sample, imputing the outcome variable for all the units of the big sample, and then finally using these imputed values to obtain standard weighted proportion using the big sample. We develop a new adjusted maximum likelihood method to avoid estimates of model variance on the boundary encountered in the commonly used in maximum likelihood estimation method. We propose an estimator of mean squared prediction error (MSPE) using a parametric bootstrap method and address computational issues by developing efficient EM algorithm. We illustrate the proposed methodology in the context of election projection for small areas.
Markov Switching models have had increasing success in time series analysis due to their ability to capture the existence of unobserved discrete states in the dynamics of the variables under study. This result is generally obtained thanks to the inference on states derived from the so--called Hamilton filter. One of the open problems in this framework is the identification of the number of states, generally fixed a priori; it is in fact impossible to apply classical tests due to the problem of the nuisance parameters present only under the alternative hypothesis. In this work we show, by Monte Carlo simulations, that fuzzy clustering is able to reproduce the parametric state inference derived from the Hamilton filter and that the typical indices used in clustering to determine the number of groups can be used to identify the number of states in this framework. The procedure is very simple to apply, considering that it is performed (in a nonparametric way) independently of the data generation process and that the indicators we use are present in most statistical packages. A final application on real data completes the analysis.
Temporally indexed data are essential in a wide range of fields and of interest to machine learning researchers. Time series data, however, are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations and the application of existing and new data-intensive ML methods. A possible solution to this bottleneck is to generate synthetic data. In this work, we introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series. TSGM includes a broad repertoire of machine learning methods: generative models, probabilistic, and simulator-based approaches. The framework enables users to evaluate the quality of the produced data from different angles: similarity, downstream effectiveness, predictive consistency, diversity, and privacy. The framework is extensible, which allows researchers to rapidly implement their own methods and compare them in a shareable environment. TSGM was tested on open datasets and in production and proved to be beneficial in both cases. Additionally to the library, the project allows users to employ command line interfaces for synthetic data generation which lowers the entry threshold for those without a programming background.
Matrix decomposition is a very important mathematical tool in numerical linear algebra for data processing. In this paper, we introduce a new randomized matrix decomposition algorithm, which is called randomized approximate SVD based on Qatar Riyal decomposition (RCSVD-QR). Our method utilize random sampling and the OR decomposition to address a serious bottlenck associated with classical SVD. RCSVD-QR gives satisfactory convergence speed as well as accuracy as compared to those state-of-the-art algorithms. In addition, we provides an estimate for the expected approximation error in Frobenius norm. Numerical experiments verify these claims.
Bayesian inference is a powerful tool for combining information in complex settings, a task of increasing importance in modern applications. However, Bayesian inference with a flawed model can produce unreliable conclusions. This review discusses approaches to performing Bayesian inference when the model is misspecified, where by misspecified we mean that the analyst is unwilling to act as if the model is correct. Much has been written about this topic, and in most cases we do not believe that a conventional Bayesian analysis is meaningful when there is serious model misspecification. Nevertheless, in some cases it is possible to use a well-specified model to give meaning to a Bayesian analysis of a misspecified model and we will focus on such cases. Three main classes of methods are discussed - restricted likelihood methods, which use a model based on a non-sufficient summary of the original data; modular inference methods which use a model constructed from coupled submodels and some of the submodels are correctly specified; and the use of a reference model to construct a projected posterior or predictive distribution for a simplified model considered to be useful for prediction or interpretation.
Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks--table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.
Time series forecasting is widely used in business intelligence, e.g., forecast stock market price, sales, and help the analysis of data trend. Most time series of interest are macroscopic time series that are aggregated from microscopic data. However, instead of directly modeling the macroscopic time series, rare literature studied the forecasting of macroscopic time series by leveraging data on the microscopic level. In this paper, we assume that the microscopic time series follow some unknown mixture probabilistic distributions. We theoretically show that as we identify the ground truth latent mixture components, the estimation of time series from each component could be improved because of lower variance, thus benefitting the estimation of macroscopic time series as well. Inspired by the power of Seq2seq and its variants on the modeling of time series data, we propose Mixture of Seq2seq (MixSeq), an end2end mixture model to cluster microscopic time series, where all the components come from a family of Seq2seq models parameterized by different parameters. Extensive experiments on both synthetic and real-world data show the superiority of our approach.