This paper studies distribution-free inference in settings where the data set has a hierarchical structure -- for example, groups of observations, or repeated measurements. In such settings, standard notions of exchangeability may not hold. To address this challenge, a hierarchical form of exchangeability is derived, facilitating extensions of distribution-free methods, including conformal prediction and jackknife+. While the standard theoretical guarantee obtained by the conformal prediction framework is a marginal predictive coverage guarantee, in the special case of independent repeated measurements, it is possible to achieve a stronger form of coverage -- the "second-moment coverage" property -- to provide better control of conditional miscoverage rates, and distribution-free prediction sets that achieve this property are constructed. Simulations illustrate that this guarantee indeed leads to uniformly small conditional miscoverage rates. Empirically, this stronger guarantee comes at the cost of a larger width of the prediction set in scenarios where the fitted model is poorly calibrated, but this cost is very mild in cases where the fitted model is accurate.
A growing number of scholars and data scientists are conducting randomized experiments to analyze causal relationships in network settings where units influence one another. A dominant methodology for analyzing these network experiments has been design-based, leveraging randomization of treatment assignment as the basis for inference. In this paper, we generalize this design-based approach so that it can be applied to more complex experiments with a variety of causal estimands with different target populations. An important special case of such generalized network experiments is a bipartite network experiment, in which the treatment assignment is randomized among one set of units and the outcome is measured for a separate set of units. We propose a broad class of causal estimands based on stochastic intervention for generalized network experiments. Using a design-based approach, we show how to estimate the proposed causal quantities without bias, and develop conservative variance estimators. We apply our methodology to a randomized experiment in education where a group of selected students in middle schools are eligible for the anti-conflict promotion program, and the program participation is randomized within this group. In particular, our analysis estimates the causal effects of treating each student or his/her close friends, for different target populations in the network. We find that while the treatment improves the overall awareness against conflict among students, it does not significantly reduce the total number of conflicts.
This paper explores an iterative coupling approach to solve linear thermo-poroelasticity problems, with its application as a high-fidelity discretization utilizing finite elements during the training of projection-based reduced order models. One of the main challenges in addressing coupled multi-physics problems is the complexity and computational expenses involved. In this study, we introduce a decoupled iterative solution approach, integrated with reduced order modeling, aimed at augmenting the efficiency of the computational algorithm. The iterative coupling technique we employ builds upon the established fixed-stress splitting scheme that has been extensively investigated for Biot's poroelasticity. By leveraging solutions derived from this coupled iterative scheme, the reduced order model employs an additional Galerkin projection onto a reduced basis space formed by a small number of modes obtained through proper orthogonal decomposition. The effectiveness of the proposed algorithm is demonstrated through numerical experiments, showcasing its computational prowess.
This paper introduces panoptica, a versatile and performance-optimized package designed for computing instance-wise segmentation quality metrics from 2D and 3D segmentation maps. panoptica addresses the limitations of existing metrics and provides a modular framework that complements the original intersection over union-based panoptic quality with other metrics, such as the distance metric Average Symmetric Surface Distance. The package is open-source, implemented in Python, and accompanied by comprehensive documentation and tutorials. panoptica employs a three-step metrics computation process to cover diverse use cases. The efficacy of panoptica is demonstrated on various real-world biomedical datasets, where an instance-wise evaluation is instrumental for an accurate representation of the underlying clinical task. Overall, we envision panoptica as a valuable tool facilitating in-depth evaluation of segmentation methods.
This paper deals with surrogate modelling of a computer code output in a hierarchical multi-fidelity context, i.e., when the output can be evaluated at different levels of accuracy and computational cost. Using observations of the output at low- and high-fidelity levels, we propose a method that combines Gaussian process (GP) regression and Bayesian neural network (BNN), in a method called GPBNN. The low-fidelity output is treated as a single-fidelity code using classical GP regression. The high-fidelity output is approximated by a BNN that incorporates, in addition to the high-fidelity observations, well-chosen realisations of the low-fidelity output emulator. The predictive uncertainty of the final surrogate model is then quantified by a complete characterisation of the uncertainties of the different models and their interaction. GPBNN is compared with most of the multi-fidelity regression methods allowing to quantify the prediction uncertainty.
We propose an implementable, feedforward neural network-based structure preserving probabilistic numerical approximation for a generalized obstacle problem describing the value of a zero-sum differential game of optimal stopping with asymmetric information. The target solution depends on three variables: the time, the spatial (or state) variable, and a variable from a standard $(I-1)$-simplex which represents the probabilities with which the $I$ possible configurations of the game are played. The proposed numerical approximation preserves the convexity of the continuous solution as well as the lower and upper obstacle bounds. We show convergence of the fully-discrete scheme to the unique viscosity solution of the continuous problem and present a range of numerical studies to demonstrate its applicability.
In this paper, two novel classes of implicit exponential Runge-Kutta (ERK) methods are studied for solving highly oscillatory systems. First of all, we analyze the symplectic conditions of two kinds of exponential integrators, and present a first-order symplectic method. In order to solve highly oscillatory problems, the highly accurate implicit ERK integrators (up to order four) are formulated by comparing the Taylor expansions of numerical and exact solutions, it is shown that the order conditions of two new kinds of exponential methods are identical to the order conditions of classical Runge-Kutta (RK) methods. Moreover, we investigate the linear stability properties of these exponential methods. Finally, numerical results not only present the long time energy preservation of the first-order symplectic method, but also illustrate the accuracy and efficiency of these formulated methods in comparison with standard ERK methods.
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
We propose an innovative and generic methodology to analyse individual and collective behaviour through individual trajectory data. The work is motivated by the analysis of GPS trajectories of fishing vessels collected from regulatory tracking data in the context of marine biodiversity conservation and ecosystem-based fisheries management. We build a low-dimensional latent representation of trajectories using convolutional neural networks as non-linear mapping. This is done by training a conditional variational auto-encoder taking into account covariates. The posterior distributions of the latent representations can be linked to the characteristics of the actual trajectories. The latent distributions of the trajectories are compared with the Bhattacharyya coefficient, which is well-suited for comparing distributions. Using this coefficient, we analyse the variation of the individual behaviour of each vessel during time. For collective behaviour analysis, we build proximity graphs and use an extension of the stochastic block model for multiple networks. This model results in a clustering of the individuals based on their set of trajectories. The application to French fishing vessels enables us to obtain groups of vessels whose individual and collective behaviours exhibit spatio-temporal patterns over the period 2014-2018.
We study the statistical capacity of the classical binary perceptrons with general thresholds $\kappa$. After recognizing the connection between the capacity and the bilinearly indexed (bli) random processes, we utilize a recent progress in studying such processes to characterize the capacity. In particular, we rely on \emph{fully lifted} random duality theory (fl RDT) established in \cite{Stojnicflrdt23} to create a general framework for studying the perceptrons' capacities. Successful underlying numerical evaluations are required for the framework (and ultimately the entire fl RDT machinery) to become fully practically operational. We present results obtained in that directions and uncover that the capacity characterizations are achieved on the second (first non-trivial) level of \emph{stationarized} full lifting. The obtained results \emph{exactly} match the replica symmetry breaking predictions obtained through statistical physics replica methods in \cite{KraMez89}. Most notably, for the famous zero-threshold scenario, $\kappa=0$, we uncover the well known $\alpha\approx0.8330786$ scaled capacity.
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.