The proliferation of mobile devices has led to the collection of large amounts of population data. This situation has prompted the need to utilize this rich, multidimensional data in practical applications. In response to this trend, we have integrated functional data analysis (FDA) and factor analysis to address the challenge of predicting hourly population changes across various districts in Tokyo. Specifically, by assuming a Gaussian process, we avoided the large covariance matrix parameters of the multivariate normal distribution. In addition, the data were both time and spatially dependent between districts. To capture these characteristics, a Bayesian factor model was introduced, which modeled the time series of a small number of common factors and expressed the spatial structure through factor loading matrices. Furthermore, the factor loading matrices were made identifiable and sparse to ensure the interpretability of the model. We also proposed a Bayesian shrinkage method as a systematic approach for factor selection. Through numerical experiments and data analysis, we investigated the predictive accuracy and interpretability of our proposed method. We concluded that the flexibility of the method allows for the incorporation of additional time series features, thereby improving its accuracy.
The regression of a functional response on a set of scalar predictors can be a challenging task, especially if there is a large number of predictors, or the relationship between those predictors and the response is nonlinear. In this work, we propose a solution to this problem: a feed-forward neural network (NN) designed to predict a functional response using scalar inputs. First, we transform the functional response to a finite-dimensional representation and construct an NN that outputs this representation. Then, we propose to modify the output of an NN via the objective function and introduce different objective functions for network training. The proposed models are suited for both regularly and irregularly spaced data, and a roughness penalty can be further applied to control the smoothness of the predicted curve. The difficulty in implementing both those features lies in the definition of objective functions that can be back-propagated. In our experiments, we demonstrate that our model outperforms the conventional function-on-scalar regression model in multiple scenarios while computationally scaling better with the dimension of the predictors.
Understanding how and why certain communities bear a disproportionate burden of disease is challenging due to the scarcity of data on these communities. Surveys provide a useful avenue for accessing hard-to-reach populations, as many surveys specifically oversample understudied and vulnerable populations. When survey data is used for analysis, it is important to account for the complex survey design that gave rise to the data, in order to avoid biased conclusions. The field of Bayesian survey statistics aims to account for such survey design while leveraging the advantages of Bayesian models, which can flexibly handle sparsity through borrowing of information and provide a coherent inferential framework to easily obtain variances for complex models and data types. For these reasons, Bayesian survey methods seem uniquely well-poised for health disparities research, where heterogeneity and sparsity are frequent considerations. This review discusses three main approaches found in the Bayesian survey methodology literature: 1) multilevel regression and post-stratification, 2) weighted pseudolikelihood-based methods, and 3) synthetic population generation. We discuss advantages and disadvantages of each approach, examine recent applications and extensions, and consider how these approaches may be leveraged to improve research in population health equity.
Gaussian graphical models are nowadays commonly applied to the comparison of groups sharing the same variables, by jointy learning their independence structures. We consider the case where there are exactly two dependent groups and the association structure is represented by a family of coloured Gaussian graphical models suited to deal with paired data problems. To learn the two dependent graphs, together with their across-graph association structure, we implement a fused graphical lasso penalty. We carry out a comprehensive analysis of this approach, with special attention to the role played by some relevant submodel classes. In this way, we provide a broad set of tools for the application of Gaussian graphical models to paired data problems. These include results useful for the specification of penalty values in order to obtain a path of lasso solutions and an ADMM algorithm that solves the fused graphical lasso optimization problem. Finally, we present an application of our method to cancer genomics where it is of interest to compare cancer cells with a control sample from histologically normal tissues adjacent to the tumor. All the methods described in this article are implemented in the $\texttt{R}$ package $\texttt{pdglasso}$ availabe at: //github.com/savranciati/pdglasso.
Three variants of the statistical complexity function, which is used as a criterion in the problem of detection of a useful signal in the signal-noise mixture, are considered. The probability distributions maximizing the considered variants of statistical complexity are obtained analytically and conclusions about the efficiency of using one or another variant for detection problem are made. The comparison of considered information characteristics is shown and analytical results are illustrated on an example of synthesized signals. A method is proposed for selecting the threshold of the information criterion, which can be used in decision rule for useful signal detection in the signal-noise mixture. The choice of the threshold depends a priori on the analytically obtained maximum values. As a result, the complexity based on the total variation demonstrates the best ability of useful signal detection.
Publishing streaming data in a privacy-preserving manner has been a key research focus for many years. This issue presents considerable challenges, particularly due to the correlations prevalent within the data stream. Existing approaches either fall short in effectively leveraging these correlations, leading to a suboptimal utility-privacy tradeoff, or they involve complex mechanism designs that increase the computation complexity with respect to the sequence length. In this paper, we introduce Sequence Information Privacy (SIP), a new privacy notion designed to guarantee privacy for an entire data stream, taking into account the intrinsic data correlations. We show that SIP provides a similar level of privacy guarantee compared to local differential privacy (LDP), and it also enjoys a lightweight modular mechanism design. We further study two online data release models (instantaneous or batched) and propose corresponding privacy-preserving data perturbation mechanisms. We provide a numerical evaluation of how correlations influence noise addition in data streams. Lastly, we conduct experiments using real-world data to compare the utility-privacy tradeoff offered by our approaches with those from existing literature. The results reveal that our mechanisms offer utility improvements more than twice those based on LDP-based mechanisms.
In this work, we consider the problem of distributed computing of functions of structured sources, focusing on the classical setting of two correlated sources and one user that seeks the outcome of the function while benefiting from low-rate side information provided by a helper node. Focusing on the case where the sources are jointly distributed according to a very general mixture model, we here provide an achievable coding scheme that manages to substantially reduce the communication cost of distributed computing by exploiting the nature of the joint distribution of the sources, the side information, as well as the symmetry enjoyed by the desired functions. Our scheme -- which can readily apply in a variety of real-life scenarios including learning, combinatorics, and graph neural network applications -- is here shown to provide substantial reductions in the communication costs, while simultaneously providing computational savings by reducing the exponential complexity of joint decoding techniques to a complexity that is merely linear.
The big data era of science and technology motivates statistical modeling of matrix-valued data using a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction for data compression and storage. Low-rank representations such as singular value decomposition factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for explicit structure beyond what is implicitly specified via data correlation. We propose a flexible prior distribution for orthonormal matrices that can explicitly model structure in the basis functions. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. To contextualize the proposed prior and model, we discuss how the prior specification can be used for various scenarios and relate the model to its deterministic counterpart. We demonstrate favorable model properties through synthetic data examples and apply our method to sea surface temperature data from the northern Pacific, enhancing our understanding of the ocean's internal variability.
Background: Biomedical data are usually collections of longitudinal data assessed at certain points in time. Clinical observations assess the presences and severity of symptoms, which are the basis for description and modeling of disease progression. Deciphering potential underlying unknowns solely from the distinct observation would substantially improve the understanding of pathological cascades. Hidden Markov Models (HMMs) have been successfully applied to the processing of possibly noisy continuous signals. The aim was to improve the application HMMs to multivariate time-series of categorically distributed data. Here, we used HHMs to study prediction of the loss of free walking ability as one major clinical deterioration in the most common autosomal dominantly inherited ataxia disorder worldwide. We used HHMs to investigate the prediction of loss of the ability to walk freely, representing a major clinical deterioration in the most common autosomal-dominant inherited ataxia disorder worldwide. Results: We present a prediction pipeline which processes data paired with a configuration file, enabling to construct, validate and query a fully parameterized HMM-based model. In particular, we provide a theoretical and practical framework for multivariate time-series inference based on HMMs that includes constructing multiple HMMs, each to predict a particular observable variable. Our analysis is done on random data, but also on biomedical data based on Spinocerebellar ataxia type 3 disease. Conclusions: HHMs are a promising approach to study biomedical data that naturally are represented as multivariate time-series. Our implementation of a HHMs framework is publicly available and can easily be adapted for further applications.
The aim of this paper is to develop estimation and inference methods for the drift parameters of multivariate L\'evy-driven continuous-time autoregressive processes of order $p\in\mathbb{N}$. Starting from a continuous-time observation of the process, we develop consistent and asymptotically normal maximum likelihood estimators. We then relax the unrealistic assumption of continuous-time observation by considering natural discretizations based on a combination of Riemann-sum, finite difference, and thresholding approximations. The resulting estimators are also proven to be consistent and asymptotically normal under a general set of conditions, allowing for both finite and infinite jump activity in the driving L\'evy process. When discretizing the estimators, allowing for irregularly spaced observations is of great practical importance. In this respect, CAR($p$) models are not just relevant for "true" continuous-time processes: a CAR($p$) specification provides a natural continuous-time interpolation for modeling irregularly spaced data - even if the observed process is inherently discrete. As a practically relevant application, we consider the setting where the multivariate observation is known to possess a graphical structure. We refer to such a process as GrCAR and discuss the corresponding drift estimators and their properties. The finite sample behavior of all theoretical asymptotic results is empirically assessed by extensive simulation experiments.
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) and have recently gained significant attention in the domain of Recommendation Systems (RS). These models, trained on massive amounts of data using self-supervised learning, have demonstrated remarkable success in learning universal representations and have the potential to enhance various aspects of recommendation systems by some effective transfer techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect of harnessing the power of language models in enhancing recommendation quality is the utilization of their high-quality representations of textual features and their extensive coverage of external knowledge to establish correlations between items and users. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec), with the latter being systematically sorted out for the first time. Furthermore, we systematically review and analyze existing LLM-based recommendation systems within each paradigm, providing insights into their methodologies, techniques, and performance. Additionally, we identify key challenges and several valuable findings to provide researchers and practitioners with inspiration.