We characterize structures such as monotonicity, convexity, and modality in smooth regression curves using persistent homology. Persistent homology is a key tool in topological data analysis that detects higher dimensional topological features such as connected components and holes (cycles or loops) in the data. In other words, persistent homology is a multiscale version of homology that characterizes sets based on the connected components and holes. We use super-level sets of functions to extract geometric features via persistent homology. In particular, we explore structures in regression curves via the persistent homology of super-level sets of a function, where the function of interest is - the first derivative of the regression function. In the course of this study, we extend an existing procedure of estimating the persistent homology for the first derivative of a regression function and establish its consistency. Moreover, as an application of the proposed methodology, we demonstrate that the persistent homology of the derivative of a function can reveal hidden structures in the function that are not visible from the persistent homology of the function itself. In addition, we also illustrate that the proposed procedure can be used to compare the shapes of two or more regression curves which is not possible merely from the persistent homology of the function itself.
Graph convexity has been used as an important tool to better understand the structure of classes of graphs. Many studies are devoted to determine if a graph equipped with a convexity is a {\em convex geometry}. In this work we survey results on characterizations of well-known classes of graphs via convex geometries. We also give some contributions to this subject.
Object data analysis is concerned with statistical methodology for datasets whose elements reside in an arbitrary, unspecified metric space. In this work we propose the object shape, a novel measure of shape/symmetry for object data. The object shape is easy to compute and interpret, owing to its intuitive interpretation as interpolation between two extreme forms of symmetry. As one major part of this work, we apply object shape in various metric spaces and show that it manages to unify several pre-existing, classical forms of symmetry. We also propose a new visualization tool called the peeling plot, which allows using the object shape for outlier detection and principal component analysis of object data.
Seismic imaging is the numerical process of creating a volumetric representation of the subsurface geological structures from elastic waves recorded at the surface of the Earth. As such, it is widely utilized in the energy and construction sectors for applications ranging from oil and gas prospection, to geothermal production and carbon capture and storage monitoring, to geotechnical assessment of infrastructures. Extracting quantitative information from seismic recordings, such as an acoustic impedance model, is however a highly ill-posed inverse problem, due to the band-limited and noisy nature of the data. This paper introduces IntraSeismic, a novel hybrid seismic inversion method that seamlessly combines coordinate-based learning with the physics of the post-stack modeling operator. Key features of IntraSeismic are i) unparalleled performance in 2D and 3D post-stack seismic inversion, ii) rapid convergence rates, iii) ability to seamlessly include hard constraints (i.e., well data) and perform uncertainty quantification, and iv) potential data compression and fast randomized access to portions of the inverted model. Synthetic and field data applications of IntraSeismic are presented to validate the effectiveness of the proposed method.
Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing smooth nonconvex objectives, where each parameter block is constrained within a subset of a Riemannian manifold. We establish that this algorithm converges asymptotically to the set of stationary points, and attains an $\epsilon$-stationary point within $\widetilde{O}(\epsilon^{-2})$ iterations. In particular, the assumptions for our complexity results are completely Euclidean when the underlying manifold is a product of Euclidean or Stiefel manifolds, although our analysis makes explicit use of the Riemannian geometry. Our general analysis applies to a wide range of algorithms with Riemannian constraints: Riemannian MM, block projected gradient descent, optimistic likelihood estimation, geodesically constrained subspace tracking, robust PCA, and Riemannian CP-dictionary-learning. We experimentally validate that our algorithm converges faster than standard Euclidean algorithms applied to the Riemannian setting.
We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.
Spatial data can come in a variety of different forms, but two of the most common generating models for such observations are random fields and point processes. Whilst it is known that spectral analysis can unify these two different data forms, specific methodology for the related estimation is yet to be developed. In this paper, we solve this problem by extending multitaper estimation, to estimate the spectral density matrix function for multivariate spatial data, where processes can be any combination of either point processes or random fields. We discuss finite sample and asymptotic theory for the proposed estimators, as well as specific details on the implementation, including how to perform estimation on non-rectangular domains and the correct implementation of multitapering for processes sampled in different ways, e.g. continuously vs on a regular grid.
Biomechanical and orthopaedic studies frequently encounter complex datasets that encompass both circular and linear variables. In most cases the circular and linear variables are (i) considered in isolation with dependency between variables neglected and (ii) the cyclicity of the circular variables disregarded resulting in erroneous decision making. Given the inherent characteristics of circular variables, it is imperative to adopt methods that integrate directional statistics to achieve precise modelling. This paper is motivated by the modelling of biomechanical data, i.e., the fracture displacements, that is used as a measure in external fixator comparisons. We focus on a data set, based on an Ilizarov ring fixator, comprising of six variables. A modelling framework applicable to the 6D joint distribution of circular-linear data based on vine copulas is proposed. The pair-copula decomposition concept of vine copulas represents the dependence structure as a combination of circular-linear, circular-circular and linear-linear pairs modelled by their respective copulas. This framework allows us to assess the dependencies in the joint distribution as well as account for the cyclicity of the circular variables. Thus, a new approach for accurate modelling of mechanical behaviour for Ilizarov ring fixators and other data of this nature is imparted.
Recently defined expectile regions capture the idea of centrality with respect to a multivariate distribution, but fail to describe the tail behavior while it is not at all clear what should be understood by a tail of a multivariate distribution. Therefore, cone expectile sets are introduced which take into account a vector preorder for the multi-dimensional data points. This provides a way of describing and clustering a multivariate distribution/data cloud with respect to an order relation. Fundamental properties of cone expectiles including dual representations of both expectile regions and cone expectile sets are established. It is shown that set-valued sublinear risk measures can be constructed from cone expectile sets in the same way as in the univariate case. Inverse functions of cone expectiles are defined which should be considered as rank functions rather than depth functions. Finally, expectile orders for random vectors are introduced and characterized via expectile rank functions.
With the increasing demand of intelligent systems capable of operating in different contexts (e.g. users on the move) the correct interpretation of the user-need by such systems has become crucial to give consistent answers to the user questions. The most effective applications addressing such task are in the fields of natural language processing and semantic expansion of terms. These techniques are aimed at estimating the goal of an input query reformulating it as an intent, commonly relying on textual resources built exploiting different semantic relations like \emph{synonymy}, \emph{antonymy} and many others. The aim of this paper is to generate such resources using the labels of a given taxonomy as source of information. The obtained resources are integrated into a plain classifier for reformulating a set of input queries as intents and tracking the effect of each relation, in order to quantify the impact of each semantic relation on the classification. As an extension to this, the best tradeoff between improvement and noise introduction when combining such relations is evaluated. The assessment is made generating the resources and their combinations and using them for tuning the classifier which is used to reformulate the user questions as labels. The evaluation employs a wide and varied taxonomy as a use-case, exploiting its labels as basis for the semantic expansion and producing several corpora with the purpose of enhancing the pseudo-queries estimation.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.