Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the "true" continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.
Probabilistic graphical models are widely used to model complex systems under uncertainty. Traditionally, Gaussian directed graphical models are applied for analysis of large networks with continuous variables as they can provide conditional and marginal distributions in closed form simplifying the inferential task. The Gaussianity and linearity assumptions are often adequate, yet can lead to poor performance when dealing with some practical applications. In this paper, we model each variable in graph G as a polynomial regression of its parents to capture complex relationships between individual variables and with a utility function of polynomial form. We develop a message-passing algorithm to propagate information throughout the network solely using moments which enables the expected utility scores to be calculated exactly. Our propagation method scales up well and enables to perform inference in terms of a finite number of expectations. We illustrate how the proposed methodology works with examples and in an application to decision problems in energy planning and for real-time clinical decision support.
This research conducts a thorough reevaluation of seismic fragility curves by utilizing ordinal regression models, moving away from the commonly used log-normal distribution function known for its simplicity. It explores the nuanced differences and interrelations among various ordinal regression approaches, including Cumulative, Sequential, and Adjacent Category models, alongside their enhanced versions that incorporate category-specific effects and variance heterogeneity. The study applies these methodologies to empirical bridge damage data from the 2008 Wenchuan earthquake, using both frequentist and Bayesian inference methods, and conducts model diagnostics using surrogate residuals. The analysis covers eleven models, from basic to those with heteroscedastic extensions and category-specific effects. Through rigorous leave-one-out cross-validation, the Sequential model with category-specific effects emerges as the most effective. The findings underscore a notable divergence in damage probability predictions between this model and conventional Cumulative probit models, advocating for a substantial transition towards more adaptable fragility curve modeling techniques that enhance the precision of seismic risk assessments. In conclusion, this research not only readdresses the challenge of fitting seismic fragility curves but also advances methodological standards and expands the scope of seismic fragility analysis. It advocates for ongoing innovation and critical reevaluation of conventional methods to advance the predictive accuracy and applicability of seismic fragility models within the performance-based earthquake engineering domain.
We present a new technique for visualizing high-dimensional data called cluster MDS (cl-MDS), which addresses a common difficulty of dimensionality reduction methods: preserving both local and global structures of the original sample in a single 2-dimensional visualization. Its algorithm combines the well-known multidimensional scaling (MDS) tool with the $k$-medoids data clustering technique, and enables hierarchical embedding, sparsification and estimation of 2-dimensional coordinates for additional points. While cl-MDS is a generally applicable tool, we also include specific recipes for atomic structure applications. We apply this method to non-linear data of increasing complexity where different layers of locality are relevant, showing a clear improvement in their retrieval and visualization quality.
Bipartite graphs are a prevalent modeling tool for real-world networks, capturing interactions between vertices of two different types. Within this framework, bicliques emerge as crucial structures when studying dense subgraphs: they are sets of vertices such that all vertices of the first type interact with all vertices of the second type. Therefore, they allow identifying groups of closely related vertices of the network, such as individuals with similar interests or webpages with similar contents. This article introduces a new algorithm designed for the exhaustive enumeration of maximal bicliques within a bipartite graph. This algorithm, called BBK for Bipartite Bron-Kerbosch, is a new extension to the bipartite case of the Bron-Kerbosch algorithm, which enumerates the maximal cliques in standard (non-bipartite) graphs. It is faster than the state-of-the-art algorithms and allows the enumeration on massive bipartite graphs that are not manageable with existing implementations. We analyze it theoretically to establish two complexity formulas: one as a function of the input and one as a function of the output characteristics of the algorithm. We also provide an open-access implementation of BBK in C++, which we use to experiment and validate its efficiency on massive real-world datasets and show that its execution time is shorter in practice than state-of-the art algorithms. These experiments also show that the order in which the vertices are processed, as well as the choice of one of the two types of vertices on which to initiate the enumeration have an impact on the computation time.
Statistical modeling of high-dimensional matrix-valued data motivates the use of a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction. Low-rank representations commonly factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for basis function correlation structure a priori. While there exist Bayesian methods that allow for a common correlation structure across basis functions, empirical examples motivate the need for basis function-specific dependence structure. We propose a prior distribution for orthonormal matrices that can explicitly model basis function-specific structure. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. We discuss how the prior specification can be used for various scenarios and demonstrate favorable model properties through synthetic data examples. Finally, we apply our method to two-meter air temperature data from the Pacific Northwest, enhancing our understanding of the Earth system's internal variability.
Graph representation learning (GRL) is critical for extracting insights from complex network structures, but it also raises security concerns due to potential privacy vulnerabilities in these representations. This paper investigates the structural vulnerabilities in graph neural models where sensitive topological information can be inferred through edge reconstruction attacks. Our research primarily addresses the theoretical underpinnings of similarity-based edge reconstruction attacks (SERA), furnishing a non-asymptotic analysis of their reconstruction capacities. Moreover, we present empirical corroboration indicating that such attacks can perfectly reconstruct sparse graphs as graph size increases. Conversely, we establish that sparsity is a critical factor for SERA's effectiveness, as demonstrated through analysis and experiments on (dense) stochastic block models. Finally, we explore the resilience of private graph representations produced via noisy aggregation (NAG) mechanism against SERA. Through theoretical analysis and empirical assessments, we affirm the mitigation of SERA using NAG . In parallel, we also empirically delineate instances wherein SERA demonstrates both efficacy and deficiency in its capacity to function as an instrument for elucidating the trade-off between privacy and utility.
Digital credentials represent a cornerstone of digital identity on the Internet. To achieve privacy, certain functionalities in credentials should be implemented. One is selective disclosure, which allows users to disclose only the claims or attributes they want. This paper presents a novel approach to selective disclosure that combines Merkle hash trees and Boneh-Lynn-Shacham (BLS) signatures. Combining these approaches, we achieve selective disclosure of claims in a single credential and creation of a verifiable presentation containing selectively disclosed claims from multiple credentials signed by different parties. Besides selective disclosure, we enable issuing credentials signed by multiple issuers using this approach.
Understanding how genetically encoded rules drive and guide complex neuronal growth processes is essential to comprehending the brain's architecture, and agent-based models (ABMs) offer a powerful simulation approach to further develop this understanding. However, accurately calibrating these models remains a challenge. Here, we present a novel application of Approximate Bayesian Computation (ABC) to address this issue. ABMs are based on parametrized stochastic rules that describe the time evolution of small components -- the so-called agents -- discretizing the system, leading to stochastic simulations that require appropriate treatment. Mathematically, the calibration defines a stochastic inverse problem. We propose to address it in a Bayesian setting using ABC. We facilitate the repeated comparison between data and simulations by quantifying the morphological information of single neurons with so-called morphometrics and resort to statistical distances to measure discrepancies between populations thereof. We conduct experiments on synthetic as well as experimental data. We find that ABC utilizing Sequential Monte Carlo sampling and the Wasserstein distance finds accurate posterior parameter distributions for representative ABMs. We further demonstrate that these ABMs capture specific features of pyramidal cells of the hippocampus (CA1). Overall, this work establishes a robust framework for calibrating agent-based neuronal growth models and opens the door for future investigations using Bayesian techniques for model building, verification, and adequacy assessment.
Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs.
Mobile devices and the Internet of Things (IoT) devices nowadays generate a large amount of heterogeneous spatial-temporal data. It remains a challenging problem to model the spatial-temporal dynamics under privacy concern. Federated learning (FL) has been proposed as a framework to enable model training across distributed devices without sharing original data which reduce privacy concern. Personalized federated learning (PFL) methods further address data heterogenous problem. However, these methods don't consider natural spatial relations among nodes. For the sake of modeling spatial relations, Graph Neural Netowork (GNN) based FL approach have been proposed. But dynamic spatial-temporal relations among edge nodes are not taken into account. Several approaches model spatial-temporal dynamics in a centralized environment, while less effort has been made under federated setting. To overcome these challeges, we propose a novel Federated Adaptive Spatial-Temporal Attention (FedASTA) framework to model the dynamic spatial-temporal relations. On the client node, FedASTA extracts temporal relations and trend patterns from the decomposed terms of original time series. Then, on the server node, FedASTA utilize trend patterns from clients to construct adaptive temporal-spatial aware graph which captures dynamic correlation between clients. Besides, we design a masked spatial attention module with both static graph and constructed adaptive graph to model spatial dependencies among clients. Extensive experiments on five real-world public traffic flow datasets demonstrate that our method achieves state-of-art performance in federated scenario. In addition, the experiments made in centralized setting show the effectiveness of our novel adaptive graph construction approach compared with other popular dynamic spatial-temporal aware methods.