Analyses of voting algorithms often overlook informational externalities shaping individual votes. For example, pre-polling information often skews voters towards candidates who may not be their top choice, but who they believe would be a worthwhile recipient of their vote. In this work, we aim to understand the role of external information in voting outcomes. We study this by analyzing (1) the probability that voting outcomes align with external information, and (2) the effect of external information on the total utility across voters, or social welfare. In practice, voting mechanisms elicit coarse information about voter utilities, such as ordinal preferences, which initially prevents us from directly analyzing the effect of informational externalities with standard voting mechanisms. To overcome this, we present an intermediary mechanism for learning how preferences change with external information which does not require eliciting full cardinal preferences. With this tool in hand, we find that voting mechanisms are generally more likely to select the alternative most favored by the external information, and when external information reflects the population's true preferences, social welfare increases in expectation.
Random probabilities are a key component to many nonparametric methods in Statistics and Machine Learning. To quantify comparisons between different laws of random probabilities several works are starting to use the elegant Wasserstein over Wasserstein distance. In this paper we prove that the infinite dimensionality of the space of probabilities drastically deteriorates its sample complexity, which is slower than any polynomial rate in the sample size. We propose a new distance that preserves many desirable properties of the former while achieving a parametric rate of convergence. In particular, our distance 1) metrizes weak convergence; 2) can be estimated numerically through samples with low complexity; 3) can be bounded analytically from above and below. The main ingredient are integral probability metrics, which lead to the name hierarchical IPM.
Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
We present a new technique for visualizing high-dimensional data called cluster MDS (cl-MDS), which addresses a common difficulty of dimensionality reduction methods: preserving both local and global structures of the original sample in a single 2-dimensional visualization. Its algorithm combines the well-known multidimensional scaling (MDS) tool with the $k$-medoids data clustering technique, and enables hierarchical embedding, sparsification and estimation of 2-dimensional coordinates for additional points. While cl-MDS is a generally applicable tool, we also include specific recipes for atomic structure applications. We apply this method to non-linear data of increasing complexity where different layers of locality are relevant, showing a clear improvement in their retrieval and visualization quality.
Digital credentials represent a cornerstone of digital identity on the Internet. To achieve privacy, certain functionalities in credentials should be implemented. One is selective disclosure, which allows users to disclose only the claims or attributes they want. This paper presents a novel approach to selective disclosure that combines Merkle hash trees and Boneh-Lynn-Shacham (BLS) signatures. Combining these approaches, we achieve selective disclosure of claims in a single credential and creation of a verifiable presentation containing selectively disclosed claims from multiple credentials signed by different parties. Besides selective disclosure, we enable issuing credentials signed by multiple issuers using this approach.
Data assimilation algorithms integrate prior information from numerical model simulations with observed data. Ensemble-based filters, regarded as state-of-the-art, are widely employed for large-scale estimation tasks in disciplines such as geoscience and meteorology. Despite their inability to produce the true posterior distribution for nonlinear systems, their robustness and capacity for state tracking are noteworthy. In contrast, Particle filters yield the correct distribution in the ensemble limit but require substantially larger ensemble sizes than ensemble-based filters to maintain stability in higher-dimensional spaces. It is essential to transcend traditional Gaussian assumptions to achieve realistic quantification of uncertainties. One approach involves the hybridisation of filters, facilitated by tempering, to harness the complementary strengths of different filters. A new adaptive tempering method is proposed to tune the underlying schedule, aiming to systematically surpass the performance previously achieved. Although promising numerical results for certain filter combinations in toy examples exist in the literature, the tuning of hyperparameters presents a considerable challenge. A deeper understanding of these interactions is crucial for practical applications.
Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs.
Characteristic formulae give a complete logical description of the behaviour of processes modulo some chosen notion of behavioural semantics. They allow one to reduce equivalence or preorder checking to model checking, and are exactly the formulae in the modal logics characterizing classic behavioural equivalences and preorders for which model checking can be reduced to equivalence or preorder checking. This paper studies the complexity of determining whether a formula is characteristic for some finite, loop-free process in each of the logics providing modal characterizations of the simulation-based semantics in van Glabbeek's branching-time spectrum. Since characteristic formulae in each of those logics are exactly the consistent and prime ones, it presents complexity results for the satisfiability and primality problems, and investigates the boundary between modal logics for which those problems can be solved in polynomial time and those for which they become computationally hard. Amongst other contributions, this article also studies the complexity of constructing characteristic formulae in the modal logics characterizing simulation-based semantics, both when such formulae are presented in explicit form and via systems of equations.
We present a novel formal system for proving quantitative-leakage properties of programs. Based on a theory of Quantitative Information Flow (QIF) that models information leakage as a noisy communication channel, it uses "gain-functions" for the description and measurement of expected leaks. We use a small imperative programming language, augmented with leakage features, and with it express adversaries' activities in the style of, but more generally than, the Hoare triples or expectation transformers that traditionally express deterministic or probabilistic correctness but without information flow. The programs are annotated with "gain-expressions" that capture simple adversarial settings such as "Guess the secret in one try." but also much more general ones; and our formal syntax and logic -based framework enables us to transform such gain-expressions that apply after a program has finished to ones that equivalently apply before the program has begun. In that way we enable a formal proof-based reasoning system for QIF at the source level. We apply it to the %programming language we have chosen, and demonstrate its effectiveness in a number of small but sometimes intricate situations.
This work presents several new results concerning the analysis of the convergence of binary, univariate, and linear subdivision schemes, all related to the {\it contractivity factor} of a convergent scheme. First, we prove that a convergent scheme cannot have a contractivity factor lower than half. Since the lower this factor is, the faster is the convergence of the scheme, schemes with contractivity factor $\frac{1}{2}$, such as those generating spline functions, have optimal convergence rate. Additionally, we provide further insights and conditions for the convergence of linear schemes and demonstrate their applicability in an improved algorithm for determining the convergence of such subdivision schemes.
Mobile devices and the Internet of Things (IoT) devices nowadays generate a large amount of heterogeneous spatial-temporal data. It remains a challenging problem to model the spatial-temporal dynamics under privacy concern. Federated learning (FL) has been proposed as a framework to enable model training across distributed devices without sharing original data which reduce privacy concern. Personalized federated learning (PFL) methods further address data heterogenous problem. However, these methods don't consider natural spatial relations among nodes. For the sake of modeling spatial relations, Graph Neural Netowork (GNN) based FL approach have been proposed. But dynamic spatial-temporal relations among edge nodes are not taken into account. Several approaches model spatial-temporal dynamics in a centralized environment, while less effort has been made under federated setting. To overcome these challeges, we propose a novel Federated Adaptive Spatial-Temporal Attention (FedASTA) framework to model the dynamic spatial-temporal relations. On the client node, FedASTA extracts temporal relations and trend patterns from the decomposed terms of original time series. Then, on the server node, FedASTA utilize trend patterns from clients to construct adaptive temporal-spatial aware graph which captures dynamic correlation between clients. Besides, we design a masked spatial attention module with both static graph and constructed adaptive graph to model spatial dependencies among clients. Extensive experiments on five real-world public traffic flow datasets demonstrate that our method achieves state-of-art performance in federated scenario. In addition, the experiments made in centralized setting show the effectiveness of our novel adaptive graph construction approach compared with other popular dynamic spatial-temporal aware methods.