We present an efficient matrix-free point spread function (PSF) method for approximating operators that have locally supported non-negative integral kernels. The method computes impulse responses of the operator at scattered points, and interpolates these impulse responses to approximate integral kernel entries. Impulse responses are computed by applying the operator to Dirac comb batches of point sources, which are chosen by solving an ellipsoid packing problem. Evaluation of kernel entries allows us to construct a hierarchical matrix (H-matrix) approximation of the operator. Further matrix computations are performed with H-matrix methods. We use the method to build preconditioners for the Hessian operator in two inverse problems governed by partial differential equations (PDEs): inversion for the basal friction coefficient in an ice sheet flow problem and for the initial condition in an advective-diffusive transport problem. While for many ill-posed inverse problems the Hessian of the data misfit term exhibits a low rank structure, and hence a low rank approximation is suitable, for many problems of practical interest the numerical rank of the Hessian is still large. But Hessian impulse responses typically become more local as the numerical rank increases, which benefits the PSF method. Numerical results reveal that the PSF preconditioner clusters the spectrum of the preconditioned Hessian near one, yielding roughly 5x-10x reductions in the required number of PDE solves, as compared to regularization preconditioning and no preconditioning. We also present a numerical study for the influence of various parameters (that control the shape of the impulse responses) on the effectiveness of the advection-diffusion Hessian approximation. The results show that the PSF-based preconditioners are able to form good approximations of high-rank Hessians using a small number of operator applications.
The first linear programming bound of McEliece, Rodemich, Rumsey, and Welch is the best known asymptotic upper bound for binary codes, for a certain subrange of distances. Starting from the work of Friedman and Tillich, there are, by now, some arguably easier and more direct arguments for this bound. We show that this more recent line of argument runs into certain difficulties if one tries to go beyond this bound (say, towards the second linear programming bound of McEliece, Rodemich, Rumsey, and Welch).
Widely used pipelines for the analysis of high-dimensional data utilize two-dimensional visualizations. These are created, e.g., via t-distributed stochastic neighbor embedding (t-SNE). When it comes to large data sets, applying these visualization techniques creates suboptimal embeddings, as the hyperparameters are not suitable for large data. Cranking up these parameters usually does not work as the computations become too expensive for practical workflows. In this paper, we argue that a sampling-based embedding approach can circumvent these problems. We show that hyperparameters must be chosen carefully, depending on the sampling rate and the intended final embedding. Further, we show how this approach speeds up the computation and increases the quality of the embeddings.
Context: Software model optimization is a process that automatically generates design alternatives, typically to enhance quantifiable non-functional properties of software systems, such as performance and reliability. Multi-objective evolutionary algorithms have shown to be effective in this context for assisting the designer in identifying trade-offs between the desired non-functional properties. Objective: In this work, we investigate the effects of imposing a time budget to limit the search for design alternatives, which inevitably affects the quality of the resulting alternatives. Method: The effects of time budgets are analyzed by investigating both the quality of the generated design alternatives and their structural features when varying the budget and the genetic algorithm (NSGA-II, PESA2, SPEA2). This is achieved by employing multi-objective quality indicators and a tree-based representation of the search space. Results: The study reveals that the time budget significantly affects the quality of Pareto fronts, especially for performance and reliability. NSGA-II is the fastest algorithm, while PESA2 generates the highest-quality solutions. The imposition of a time budget results in structurally distinct models compared to those obtained without a budget, indicating that the search process is influenced by both the budget and algorithm selection. Conclusions: In software model optimization, imposing a time budget can be effective in saving optimization time, but designers should carefully consider the trade-off between time and solution quality in the Pareto front, along with the structural characteristics of the generated models. By making informed choices about the specific genetic algorithm, designers can achieve different trade-offs.
Introduction: Oblique Target-rotation in the context of exploratory factor analysis is a relevant method for the investigation of the oblique independent clusters model. It was argued that minimizing single cross-loadings by means of target rotation may lead to large effects of sampling error on the target rotated factor solutions. Method: In order to minimize effects of sampling error on results of Target-rotation we propose to compute the mean cross-loadings for each block of salient loadings of the independent clusters model and to perform target rotation for the block-wise mean cross-loadings. The resulting transformation-matrix is than applied to the complete unrotated loading matrix in order to produce mean Target-rotated factors. Results: A simulation study based on correlated independent factor models revealed that mean oblique Target-rotation resulted in smaller negative bias of factor inter-correlations than conventional Target-rotation based on single loadings, especially when sample size was small and when the number of factors was large. An empirical example revealed that the similarity of Target-rotated factors computed for small subsamples with Target-rotated factors of the total sample was more pronounced for mean Target-rotation than for conventional Target-rotation. Discussion: Mean Target-rotation can be recommended in the context of oblique independent factor models, especially for small samples. An R-script and an SPSS-script for this form of Target-rotation are provided in the Appendix.
This study addresses a class of linear mixed-integer programming (MILP) problems that involve uncertainty in the objective function parameters. The parameters are assumed to form a random vector, whose probability distribution can only be observed through a finite training data set. Unlike most of the related studies in the literature, we also consider uncertainty in the underlying data set. The data uncertainty is described by a set of linear constraints for each random sample, and the uncertainty in the distribution (for a fixed realization of data) is defined using a type-1 Wasserstein ball centered at the empirical distribution of the data. The overall problem is formulated as a three-level distributionally robust optimization (DRO) problem. First, we prove that the three-level problem admits a single-level MILP reformulation, if the class of loss functions is restricted to biaffine functions. Secondly, it turns out that for several particular forms of data uncertainty, the outlined problem can be solved reasonably fast by leveraging the nominal MILP problem. Finally, we conduct a computational study, where the out-of-sample performance of our model and computational complexity of the proposed MILP reformulation are explored numerically for several application domains.
Advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated the microbiome study. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co-occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high-dimensional and compositional, suffering from uneven sampling depth, over-dispersion, and zero-inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To this end, we propose a Bayesian stochastic block model to study the microbiome co-occurrence network based on the recently developed modified centered-log ratio transformation tailored for microbiome data analysis. Our model allows us to incorporate taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non-informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women, the first time the urinary microbiome co-occurrence network structure has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies.
We describe an efficient method for the approximation of functions using radial basis functions (RBFs), and extend this to a solver for boundary value problems on irregular domains. The method is based on RBFs with centers on a regular grid defined on a bounding box, with some of the centers outside the computational domain. The equation is discretized using collocation with oversampling, with collocation points inside the domain only, resulting in a rectangular linear system to be solved in a least squares sense. The goal of this paper is the efficient solution of that rectangular system. We show that the least squares problem splits into a regular part, which can be expedited with the FFT, and a low rank perturbation, which is treated separately with a direct solver. The rank of the perturbation is influenced by the irregular shape of the domain and by the weak enforcement of boundary conditions at points along the boundary. The solver extends the AZ algorithm which was previously proposed for function approximation involving frames and other overcomplete sets. The solver has near optimal log-linear complexity for univariate problems, and loses optimality for higher-dimensional problems but remains faster than a direct solver.
The distribution regression problem encompasses many important statistics and machine learning tasks, and arises in a large range of applications. Among various existing approaches to tackle this problem, kernel methods have become a method of choice. Indeed, kernel distribution regression is both computationally favorable, and supported by a recent learning theory. This theory also tackles the two-stage sampling setting, where only samples from the input distributions are available. In this paper, we improve the learning theory of kernel distribution regression. We address kernels based on Hilbertian embeddings, that encompass most, if not all, of the existing approaches. We introduce the novel near-unbiased condition on the Hilbertian embeddings, that enables us to provide new error bounds on the effect of the two-stage sampling, thanks to a new analysis. We show that this near-unbiased condition holds for three important classes of kernels, based on optimal transport and mean embedding. As a consequence, we strictly improve the existing convergence rates for these kernels. Our setting and results are illustrated by numerical experiments.
We introduce a formulation of optimal transport problem for distributions on function spaces, where the stochastic map between functional domains can be partially represented in terms of an (infinite-dimensional) Hilbert-Schmidt operator mapping a Hilbert space of functions to another. For numerous machine learning tasks, data can be naturally viewed as samples drawn from spaces of functions, such as curves and surfaces, in high dimensions. Optimal transport for functional data analysis provides a useful framework of treatment for such domains. { Since probability measures in infinite dimensional spaces generally lack absolute continuity (that is, with respect to non-degenerate Gaussian measures), the Monge map in the standard optimal transport theory for finite dimensional spaces may not exist. Our approach to the optimal transport problem in infinite dimensions is by a suitable regularization technique -- we restrict the class of transport maps to be a Hilbert-Schmidt space of operators.} To this end, we develop an efficient algorithm for finding the stochastic transport map between functional domains and provide theoretical guarantees on the existence, uniqueness, and consistency of our estimate for the Hilbert-Schmidt operator. We validate our method on synthetic datasets and examine the functional properties of the transport map. Experiments on real-world datasets of robot arm trajectories further demonstrate the effectiveness of our method on applications in domain adaptation.
The numerical solution of continuum damage mechanics (CDM) problems suffers from critical points during the material softening stage, and consequently existing iterative solvers are subject to a trade-off between computational expense and solution accuracy. Displacement-controlled arc-length methods were developed to address these challenges, but are currently applicable only to geometrically non-linear problems. In this work, we present a novel displacement-controlled arc-length (DAL) method for CDM problems in both local damage and non-local gradient damage versions. The analytical tangent matrix is derived for the DAL solver in both of the local and the non-local models. In addition, several consistent and non-consistent implementation algorithms are proposed, implemented, and evaluated. Unlike existing force-controlled arc-length solvers that monolithically scale the external force vector, the proposed method treats the external force vector as an independent variable and determines the position of the system on the equilibrium path based on all the nodal variations of the external force vector. Such a flexible approach renders the proposed solver to be substantially more efficient and versatile than existing solvers used in CDM problems. The considerable advantages of the proposed DAL algorithm are demonstrated against several benchmark 1D problems with sharp snap-backs and 2D examples with various boundary conditions and loading scenarios, where the proposed method drastically outperforms existing conventional approaches in terms of accuracy, computational efficiency, and the ability to predict the complete equilibrium path including all critical points.