The diffusion of AI and big data is reshaping decision-making processes by increasing the amount of information that supports decisions while reducing direct interaction with data and empirical evidence. This paradigm shift introduces new sources of uncertainty, as limited data observability results in ambiguity and a lack of interpretability. The need for the proper analysis of data-driven strategies motivates the search for new models that can describe this type of bounded access to knowledge. This contribution presents a novel theoretical model for uncertainty in knowledge representation and its transfer mediated by agents. We provide a dynamical description of knowledge states by endowing our model with a structure to compare and combine them. Specifically, an update is represented through combinations, and its explainability is based on its consistency in different dimensional representations. We look at inequivalent knowledge representations in terms of multiplicity of inferences, preference relations, and information measures. Furthermore, we define a formal analogy with two scenarios that illustrate non-classical uncertainty in terms of ambiguity (Ellsberg's model) and reasoning about knowledge mediated by other agents observing data (Wigner's friend). Finally, we discuss some implications of the proposed model for data-driven strategies, with special attention to reasoning under uncertainty about business value dimensions and the design of measurement tools for their assessment.
This survey is concerned with the power of random information for approximation in the (deterministic) worst-case setting, with special emphasis on information consisting of functionals selected independently and identically distributed (iid) at random on a class of admissible information functionals. We present a general result based on a weighted least squares method and derive consequences for special cases. Improvements are available if the information is ``Gaussian'' or if we consider iid function values for Sobolev spaces. We include open questions to guide future research on the power of random information in the context of information-based complexity.
Within Bayesian nonparametrics, dependent Dirichlet process mixture models provide a highly flexible approach for conducting inference about the conditional density function. However, several formulations of this class make either rather restrictive modelling assumptions or involve intricate algorithms for posterior inference, thus preventing their widespread use. In response to these challenges, we present a flexible, versatile, and computationally tractable model for density regression based on a single-weights dependent Dirichlet process mixture of normal distributions model for univariate continuous responses. We assume an additive structure for the mean of each mixture component and incorporate the effects of continuous covariates through smooth nonlinear functions. The key components of our modelling approach are penalised B-splines and their bivariate tensor product extension. Our proposed method also seamlessly accommodates parametric effects of categorical covariates, linear effects of continuous covariates, interactions between categorical and/or continuous covariates, varying coefficient terms, and random effects, which is why we refer our model as a Dirichlet process mixture of normal structured additive regression models. A noteworthy feature of our method is its efficiency in posterior simulation through Gibbs sampling, as closed-form full conditional distributions for all model parameters are available. Results from a simulation study demonstrate that our approach successfully recovers true conditional densities and other regression functionals in various challenging scenarios. Applications to a toxicology, disease diagnosis, and agricultural study are provided and further underpin the broad applicability of our modelling framework. An R package, \texttt{DDPstar}, implementing the proposed method is publicly available at \url{//bitbucket.org/mxrodriguez/ddpstar}.
The theory of generalized locally Toeplitz (GLT) sequences is a powerful apparatus for computing the asymptotic spectral distribution of matrices $A_n$ arising from numerical discretizations of differential equations. Indeed, when the mesh fineness parameter $n$ tends to infinity, these matrices $A_n$ give rise to a sequence $\{A_n\}_n$, which often turns out to be a GLT sequence. In this paper, we extend the theory of GLT sequences in several directions: we show that every GLT sequence enjoys a normal form, we identify the spectral symbol of every GLT sequence formed by normal matrices, and we prove that, for every GLT sequence $\{A_n\}_n$ formed by normal matrices and every continuous function $f:\mathbb C\to\mathbb C$, the sequence $\{f(A_n)\}_n$ is again a GLT sequence whose spectral symbol is $f(\kappa)$, where $\kappa$ is the spectral symbol of $\{A_n\}_n$. In addition, using the theory of GLT sequences, we prove a spectral distribution result for perturbed normal matrices.
Two-sample spiked model is an important issue in multivariate statistical inference. This paper focuses on testing the number of spikes in a high-dimensional generalized two-sample spiked model, which is free of Gaussian population assumption and the diagonal or block-wise diagonal restriction of population covariance matrix, and the spiked eigenvalues are not necessary required to be bounded. In order to determine the number of spikes, we first propose a general test, which relies on the partial linear spectral statistics. We establish its asymptotic normality under the null hypothesis. Then we apply the conclusion to two statistical problem, variable selection in large-dimensional linear regression and change point detection when change points and additive outliers exist simultaneously. Simulations and empirical analysis are conducted to illustrate the good performance of our methods.
In many medical image classification problems, labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on equal footing. Furthermore, past benchmarks often handle hyperparameter tuning suboptimally. First, they may not tune hyperparameters at all, leading to underfitting. Second, when tuning does occur, it often unrealistically uses a labeled validation set much larger than the train set. Both cases make previously published rankings of methods difficult to translate to practical settings. This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so, when all methods are tuned well, which self- or semi-supervised methods reach the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ total GPU hours of computation, we provide valuable best practices to resource-constrained, results-focused practitioners.
Scientific discoveries are increasingly constrained by limited storage space and I/O capacities. For time-series simulations and experiments, their data often need to be decimated over timesteps to accommodate storage and I/O limitations. In this paper, we propose a technique that addresses storage costs while improving post-analysis accuracy through spatiotemporal adaptive, error-controlled lossy compression. We investigate the trade-off between data precision and temporal output rates, revealing that reducing data precision and increasing timestep frequency lead to more accurate analysis outcomes. Additionally, we integrate spatiotemporal feature detection with data compression and demonstrate that performing adaptive error-bounded compression in higher dimensional space enables greater compression ratios, leveraging the error propagation theory of a transformation-based compressor. To evaluate our approach, we conduct experiments using the well-known E3SM climate simulation code and apply our method to compress variables used for cyclone tracking. Our results show a significant reduction in storage size while enhancing the quality of cyclone tracking analysis, both quantitatively and qualitatively, in comparison to the prevalent timestep decimation approach. Compared to three state-of-the-art lossy compressors lacking feature preservation capabilities, our adaptive compression framework improves perfectly matched cases in TC tracking by 26.4-51.3% at medium compression ratios and by 77.3-571.1% at large compression ratios, with a merely 5-11% computational overhead.
Fiber-reinforced composites (FRC) provide structural systems with unique features that appeal to various civilian and military sectors. Often, one needs to modulate the temperature field to achieve the intended functionalities (e.g., self-healing) in these lightweight structures. Vascular-based active cooling offers one efficient way of thermal regulation in such material systems. However, the thermophysical properties (e.g., thermal conductivity, specific heat capacity) of FRC and their base constituents depend on temperature, and such structures are often subject to a broad spectrum of temperatures. Notably, prior active cooling modeling studies did not account for such temperature dependence. Thus, the primary aim of this paper is to reveal the effect of temperature-dependent material properties -- obtained via material characterization -- on the qualitative and quantitative behaviors of active cooling. By applying mathematical analysis and conducting numerical simulations, we show this dependence does not affect qualitative attributes, such as minimum and maximum principles (in the same spirit as \textsc{Hopf}'s results for elliptic partial differential equations). However, the dependence slightly affects quantitative results, such as the mean surface temperature and thermal efficiency. The import of our study is that it provides a deeper understanding of thermal regulation systems under practical scenarios and can guide researchers and practitioners in perfecting associated designs.
Histopathological image classification is an important task in medical image analysis. Recent approaches generally rely on weakly supervised learning due to the ease of acquiring case-level labels from pathology reports. However, patch-level classification is preferable in applications where only a limited number of cases are available or when local prediction accuracy is critical. On the other hand, acquiring extensive datasets with localized labels for training is not feasible. In this paper, we propose a semi-supervised patch-level histopathological image classification model, named CLASS-M, that does not require extensively labeled datasets. CLASS-M is formed by two main parts: a contrastive learning module that uses separated Hematoxylin and Eosin images generated through an adaptive stain separation process, and a module with pseudo-labels using MixUp. We compare our model with other state-of-the-art models on two clear cell renal cell carcinoma datasets. We demonstrate that our CLASS-M model has the best performance on both datasets. Our code is available at github.com/BzhangURU/Paper_CLASS-M/tree/main
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.