Synthetic data generation is a promising technique to facilitate the use of sensitive data while mitigating the risk of privacy breaches. However, for synthetic data to be useful in downstream analysis tasks, it needs to be of sufficient quality. Various methods have been proposed to measure the utility of synthetic data, but their results are often incomplete or even misleading. In this paper, we propose using density ratio estimation to improve quality evaluation for synthetic data, and thereby the quality of synthesized datasets. We show how this framework relates to and builds on existing measures, yielding global and local utility measures that are informative and easy to interpret. We develop an estimator which requires little to no manual tuning due to automatic selection of a nonparametric density ratio model. Through simulations, we find that density ratio estimation yields more accurate estimates of global utility than established procedures. A real-world data application demonstrates how the density ratio can guide refinements of synthesis models and can be used to improve downstream analyses. We conclude that density ratio estimation is a valuable tool in synthetic data generation workflows and provide these methods in the accessible open source R-package densityratio.
We develop a new, unsupervised symmetry learning method that starts with raw data, and gives the minimal (discrete) generator of an underlying Lie group of symmetries, together with a symmetry equivariant representation of the data. The method is able to learn the pixel translation operator from a dataset with only an approximate translation symmetry, and can learn quite different types of symmetries which are not apparent to the naked eye, equally well. The method is based on the formulation of an information-theoretic loss function that measures both the degree to which the dataset is symmetric under a given candidate symmetry, and also, the degree of locality of the samples in the dataset with respect to this symmetry. We demonstrate that this coupling between symmetry and locality, together with a special optimization technique developed for entropy estimation, results in a highly stable system that gives reproducible results. The symmetry actions we consider are group representations, however, we believe the approach has the potential to be generalized to more general, nonlinear actions of non-commutative Lie groups.
We investigate notions of complete representation by partial functions, where the operations in the signature include antidomain restriction and may include composition, intersection, update, preferential union, domain, antidomain, and set difference. When the signature includes both antidomain restriction and intersection, the join-complete and the meet-complete representations coincide. Otherwise, for the signatures we consider, meet-complete is strictly stronger than join-complete. A necessary condition to be meet-completely representable is that the atoms are separating. For the signatures we consider, this condition is sufficient if and only if composition is not in the signature. For each of the signatures we consider, the class of (meet-)completely representable algebras is not axiomatisable by any existential-universal-existential first-order theory. For 14 expressively distinct signatures, we show, by giving an explicit representation, that the (meet-)completely representable algebras form a basic elementary class, axiomatisable by a universal-existential-universal first-order sentence. The signatures we axiomatise are those containing antidomain restriction and any of intersection, update, and preferential union and also those containing antidomain restriction, composition, and intersection and any of update, preferential union, domain, and antidomain.
We consider the task of data-driven identification of dynamical systems, specifically for systems whose behavior at large frequencies is non-standard, as encoded by a non-trivial relative degree of the transfer function or, alternatively, a non-trivial index of a corresponding realization as a descriptor system. We develop novel surrogate modeling strategies that allow state-of-the-art rational approximation algorithms (e.g., AAA and vector fitting) to better handle data coming from such systems with non-trivial relative degree. Our contribution is twofold. On one hand, we describe a strategy to build rational surrogate models with prescribed relative degree, with the objective of mirroring the high-frequency behavior of the high-fidelity problem, when known. The surrogate model's desired degree is achieved through constraints on its barycentric coefficients, rather than through ad-hoc modifications of the rational form. On the other hand, we present a degree-identification routine that allows one to estimate the unknown relative degree of a system from low-frequency data. By identifying the degree of the system that generated the data, we can build a surrogate model that, in addition to matching the data well (at low frequencies), has enhanced extrapolation capabilities (at high frequencies). We showcase the effectiveness and robustness of the newly proposed method through a suite of numerical tests.
This work focuses on quantum methods for cryptanalysis of schemes based on the integer factorization problem and the discrete logarithm problem. We demonstrate how to practically solve the largest instances of the factorization problem by improving an approach that combines quantum and classical computations, assuming the use of the best publicly available special-class quantum computer: the quantum annealer. We achieve new computational experiment results by solving the largest instance of the factorization problem ever announced as solved using quantum annealing, with a size of 29 bits. The core idea of the improved approach is to leverage known sub-exponential classical method to break the problem down into many smaller computations and perform the most critical ones on a quantum computer. This approach does not reduce the complexity class, but it assesses the pragmatic capabilities of an attacker. It also marks a step forward in the development of hybrid methods, which in practice may surpass classical methods in terms of efficiency sooner than purely quantum computations will.
A numerical method ADER-DG with a local DG predictor for solving a DAE system has been developed, which was based on the formulation of ADER-DG methods using a local DG predictor for solving ODE and PDE systems. The basis functions were chosen in the form of Lagrange interpolation polynomials with nodal points at the roots of the Radau polynomials, which differs from the classical formulations of the ADER-DG method, where it is customary to use the roots of Legendre polynomials. It was shown that the use of this basis leads to A-stability and L1-stability in the case of using the DAE solver as ODE solver. The numerical method ADER-DG allows one to obtain a highly accurate numerical solution even on very coarse grids, with a step greater than the main characteristic scale of solution variation. The local discrete time solution can be used as a numerical solution of the DAE system between grid nodes, thereby providing subgrid resolution even in the case of very coarse grids. The classical test examples were solved by developed numerical method ADER-DG. With increasing index of the DAE system, a decrease in the empirical convergence orders p is observed. An unexpected result was obtained in the numerical solution of the stiff DAE system -- the empirical convergence orders of the numerical solution obtained using the developed method turned out to be significantly higher than the values expected for this method in the case of stiff problems. It turns out that the use of Lagrange interpolation polynomials with nodal points at the roots of the Radau polynomials is much better suited for solving stiff problems. Estimates showed that the computational costs of the ADER-DG method are approximately comparable to the computational costs of implicit Runge-Kutta methods used to solve DAE systems. Methods were proposed to reduce the computational costs of the ADER-DG method.
In many applied sciences a popular analysis strategy for high-dimensional data is to fit many multivariate generalized linear models in parallel. This paper presents a novel approach to address the resulting multiple testing problem by combining a recently developed sign-flip test with permutation-based multiple-testing procedures. Our method builds upon the univariate standardized flip-scores test which offers robustness against misspecified variances in generalized linear models, a crucial feature in high-dimensional settings where comprehensive model validation is particularly challenging. We extend this approach to the multivariate setting, enabling adaptation to unknown response correlation structures. This approach yields relevant power improvements over conventional multiple testing methods when correlation is present.
Uncertainty around predictions from a model due to the finite size of the development sample has traditionally been approached using classical inferential techniques. The finite size of the sample will result in the discrepancy between the final model and the correct model that maps covariates to predicted risks. From a decision-theoretic perspective, this discrepancy might affect the subsequent treatment decisions, and thus is associated with utility loss. From this perspective, procuring more development data is associated in expected gain in the utility of using the model. In this work, we define the Expected Value of Sample Information (EVSI) as the expected gain in clinical utility, defined in net benefit (NB) terms in net true positive units, by procuring a further development sample of a given size. We propose a bootstrap-based algorithm for EVSI computations, and show its feasibility and face validity in a case study. Decision-theoretic metrics can complement classical inferential methods when designing studies that are aimed at developing risk prediction models.
Functional principal component analysis has been shown to be invaluable for revealing variation modes of longitudinal outcomes, which serves as important building blocks for forecasting and model building. Decades of research have advanced methods for functional principal component analysis often assuming independence between the observation times and longitudinal outcomes. Yet such assumptions are fragile in real-world settings where observation times may be driven by outcome-related reasons. Rather than ignoring the informative observation time process, we explicitly model the observational times by a counting process dependent on time-varying prognostic factors. Identification of the mean, covariance function, and functional principal components ensues via inverse intensity weighting. We propose using weighted penalized splines for estimation and establish consistency and convergence rates for the weighted estimators. Simulation studies demonstrate that the proposed estimators are substantially more accurate than the existing ones in the presence of a correlation between the observation time process and the longitudinal outcome process. We further examine the finite-sample performance of the proposed method using the Acute Infection and Early Disease Research Program study.
To solve many problems on graphs, graph traversals are used, the usual variants of which are the depth-first search and the breadth-first search. Implementing a graph traversal we consequently reach all vertices of the graph that belong to a connected component. The breadth-first search is the usual choice when constructing efficient algorithms for finding connected components of a graph. Methods of simple iteration for solving systems of linear equations with modified graph adjacency matrices and with the properly specified right-hand side can be considered as graph traversal algorithms. These traversal algorithms, generally speaking, turn out to be non-equivalent neither to the depth-first search nor the breadth-first search. The example of such a traversal algorithm is the one associated with the Gauss-Seidel method. For an arbitrary connected graph, to visit all its vertices, the algorithm requires not more iterations than that is required for BFS. For a large number of instances of the problem, fewer iterations will be required.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.