In this contribution, we deal with seed-based information retrieval in networks of research publications. Using systematic reviews as a baseline, and publication data from the NIH Open Citation Collection, we compare the performance of the three citation-based approaches direct citation, co-citation, and bibliographic coupling with respect to recall and precision measures. In addition, we include the PubMed Related Article score as well as combined approaches in the comparison. We also provide a fairly comprehensive review of earlier research in which citation relations have been used for information retrieval purposes. The results show an advantage for co-citation over bibliographic coupling and direct citation. However, combining the three approaches outperforms the exclusive use of co-citation in the study. The results further indicate, in line with previous research, that combining citation-based approaches with textual approaches enhances the performance of seed-based information retrieval. The results from the study may guide approaches combining citation-based and textual approaches in their choice of citation similarity measures. We suggest that future research use more structured approaches to evaluate methods for seed-based retrieval of publications, including comparative approaches as well as the elaboration of common data sets and baselines for evaluation.
We introduce a new conjecture on the computational hardness of detecting random lifts of graphs: we claim that there is no polynomial-time algorithm that can distinguish between a large random $d$-regular graph and a large random lift of a Ramanujan $d$-regular base graph (provided that the lift is corrupted by a small amount of extra noise), and likewise for bipartite random graphs and lifts of bipartite Ramanujan graphs. We give evidence for this conjecture by proving lower bounds against the local statistics hierarchy of hypothesis testing semidefinite programs. We then explore the consequences of this conjecture for the hardness of certifying bounds on numerous functions of random regular graphs, expanding on a direction initiated by Bandeira, Banks, Kunisky, Moore, and Wein (2021). Conditional on this conjecture, we show that no polynomial-time algorithm can certify tight bounds on the maximum cut of random 3- or 4-regular graphs, the maximum independent set of random 3- or 4-regular graphs, or the chromatic number of random 7-regular graphs. We show similar gaps asymptotically for large degree for the maximum independent set and for any degree for the minimum dominating set, finding that naive spectral and combinatorial bounds are optimal among all polynomial-time certificates. Likewise, for small-set vertex and edge expansion in the limit of very small sets, we show that the spectral bounds of Kahale (1995) are optimal among all polynomial-time certificates.
Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
In human neuroimaging studies, atlas registration enables mapping MRI scans to a common coordinate frame, which is necessary to aggregate data from multiple subjects. Machine learning registration methods have achieved excellent speed and accuracy but lack interpretability. More recently, keypoint-based methods have been proposed to tackle this issue, but their accuracy is still subpar, particularly when fitting nonlinear transforms. Here we propose Registration by Regression (RbR), a novel atlas registration framework that is highly robust and flexible, conceptually simple, and can be trained with cheaply obtained data. RbR predicts the (x,y,z) atlas coordinates for every voxel of the input scan (i.e., every voxel is a keypoint), and then uses closed-form expressions to quickly fit transforms using a wide array of possible deformation models, including affine and nonlinear (e.g., Bspline, Demons, invertible diffeomorphic models, etc.). Robustness is provided by the large number of voxels informing the registration and can be further increased by robust estimators like RANSAC. Experiments on independent public datasets show that RbR yields more accurate registration than competing keypoint approaches, while providing full control of the deformation model.
Histogramming is often taken for granted, but the power and compactness of partially aggregated, multidimensional summary statistics, and their fundamental connection to differential and integral calculus make them formidable statistical objects, especially when very large data volumes are involved. But expressing these concepts robustly and efficiently in high-dimensional parameter spaces and for large data samples is a highly non-trivial challenge -- doubly so if the resulting library is to remain usable by scientists as opposed to software engineers. In this paper we summarise the core principles required for consistent generalised histogramming, and use them to motivate the design principles and implementation mechanics of the re-engineered YODA histogramming library, a key component of physics data-model comparison and statistical interpretation in collider physics.
In Bayesian theory, the role of information is central. The influence exerted by prior information on posterior outcomes often jeopardizes Bayesian studies, due to the potentially subjective nature of the prior choice. In modeling where a priori knowledge is lacking, the reference prior theory emerges as a proficient tool. Based on the criterion of mutual information, this theory makes it possible to construct a non-informative prior whose choice can be qualified as objective. In this paper, we contribute to the enrichment of reference prior theory. Indeed, we unveil an original analogy between reference prior theory and Global Sensitivity Analysis, from which we propose a natural generalization of the mutual information definition. Leveraging dissimilarity measures between probability distributions, such as f-divergences, we provide a formalized framework for what we term generalized reference priors. Our main result offers a limit of mutual information, simplifying the definition of reference priors as its maximal arguments. This approach opens a new way that facilitates the theoretical derivation of reference priors under constraints or within specific classes. In the absence of constraints, we further prove that the Jeffreys prior maximizes the generalized mutual information considered.
With the advent of massive data sets much of the computational science and engineering community has moved toward data-intensive approaches in regression and classification. However, these present significant challenges due to increasing size, complexity and dimensionality of the problems. In particular, covariance matrices in many cases are numerically unstable and linear algebra shows that often such matrices cannot be inverted accurately on a finite precision computer. A common ad hoc approach to stabilizing a matrix is application of a so-called nugget. However, this can change the model and introduce error to the original solution. It is well known from numerical analysis that ill-conditioned matrices cannot be accurately inverted. In this paper we develop a multilevel computational method that scales well with the number of observations and dimensions. A multilevel basis is constructed adapted to a kD-tree partitioning of the observations. Numerically unstable covariance matrices with large condition numbers can be transformed into well conditioned multilevel ones without compromising accuracy. Moreover, it is shown that the multilevel prediction exactly solves the Best Linear Unbiased Predictor (BLUP) and Generalized Least Squares (GLS) model, but is numerically stable. The multilevel method is tested on numerically unstable problems of up to 25 dimensions. Numerical results show speedups of up to 42,050 times for solving the BLUP problem, but with the same accuracy as the traditional iterative approach. For very ill-conditioned cases the speedup is infinite. In addition, decay estimates of the multilevel covariance matrices are derived based on high dimensional interpolation techniques from the field of numerical analysis. This work lies at the intersection of statistics, uncertainty quantification, high performance computing and computational applied mathematics.
In ecology we may find scenarios where the same phenomenon (species occurrence, species abundance, etc.) is observed using two different types of samplers. For instance, species data can be collected from scientific sampling with a completely random sample pattern, but also from opportunistic sampling (e.g., whale or bird watching fishery commercial vessels), in which observers tend to look for a specific species in areas where they expect to find Species Distribution Models (SDMs) are a widely used tool for analyzing this kind of ecological data. Specifically, we have two models available for the above data: a geostatistical model (GM) for the data coming from a complete random sampler and a preferential model (PM) for data from opportunistic sampling. Integration of information coming from different sources can be handled via expert elicitation and integrated models. We focus here in a sequential Bayesian procedure to connect two models through the update of prior distributions. Implementation of the Bayesian paradigm is done through the integrated nested Laplace approximation (INLA) methodology, a good option to make inference and prediction in spatial models with high performance and low computational costs. This sequential approach has been evaluated by simulating several scenarios and comparing the results of sharing information from one model to another using different criteria. The procedure has also been exemplified with a real dataset. Our main results imply that, in general, it is better to share information from the independent (completely random) to the preferential model than the alternative way. However, it depends on different factors such as the spatial range or the spatial arrangement of sampling locations.
A key requirement for the success of supervised deep learning is a large labeled dataset - a condition that is difficult to meet in medical image analysis. Self-supervised learning (SSL) can help in this regard by providing a strategy to pre-train a neural network with unlabeled data, followed by fine-tuning for a downstream task with limited annotations. Contrastive learning, a particular variant of SSL, is a powerful technique for learning image-level representations. In this work, we propose strategies for extending the contrastive learning framework for segmentation of volumetric medical images in the semi-supervised setting with limited annotations, by leveraging domain-specific and problem-specific cues. Specifically, we propose (1) novel contrasting strategies that leverage structural similarity across volumetric medical images (domain-specific cue) and (2) a local version of the contrastive loss to learn distinctive representations of local regions that are useful for per-pixel segmentation (problem-specific cue). We carry out an extensive evaluation on three Magnetic Resonance Imaging (MRI) datasets. In the limited annotation setting, the proposed method yields substantial improvements compared to other self-supervision and semi-supervised learning techniques. When combined with a simple data augmentation technique, the proposed method reaches within 8% of benchmark performance using only two labeled MRI volumes for training, corresponding to only 4% (for ACDC) of the training data used to train the benchmark.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.
Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.