The world's digital information ecosystem continues to struggle with the spread of misinformation. Prior work has suggested that users who consistently disseminate a disproportionate amount of low-credibility content -- so-called superspreaders -- are at the center of this problem. We quantitatively confirm this hypothesis and introduce simple metrics to predict the top superspreaders several months into the future. We then conduct a qualitative review to characterize the most prolific superspreaders and analyze their sharing behaviors. Superspreaders include pundits with large followings, low-credibility media outlets, personal accounts affiliated with those media outlets, and a range of influencers. They are primarily political in nature and use more toxic language than the typical user sharing misinformation. We also find concerning evidence that suggests Twitter may be overlooking prominent superspreaders. We hope this work will further public understanding of bad actors and promote steps to mitigate their negative impacts on healthy digital discourse.
Conformal inference is a popular tool for constructing prediction intervals (PI). We consider here the scenario of post-selection/selective conformal inference, that is PIs are reported only for individuals selected from an unlabeled test data. To account for multiplicity, we develop a general split conformal framework to construct selective PIs with the false coverage-statement rate (FCR) control. We first investigate the Benjamini and Yekutieli (2005)'s FCR-adjusted method in the present setting, and show that it is able to achieve FCR control but yields uniformly inflated PIs. We then propose a novel solution to the problem, named as Selective COnditional conformal Predictions (SCOP), which entails performing selection procedures on both calibration set and test set and construct marginal conformal PIs on the selected sets by the aid of conditional empirical distribution obtained by the calibration set. Under a unified framework and exchangeable assumptions, we show that the SCOP can exactly control the FCR. More importantly, we provide non-asymptotic miscoverage bounds for a general class of selection procedures beyond exchangeablity and discuss the conditions under which the SCOP is able to control the FCR. As special cases, the SCOP with quantile-based selection or conformal p-values-based multiple testing procedures enjoys valid coverage guarantee under mild conditions. Numerical results confirm the effectiveness and robustness of SCOP in FCR control and show that it achieves more narrowed PIs over existing methods in many settings.
Since coral reef ecosystems face threats from human activities and climate change, coral conservation programs are implemented worldwide. Monitoring coral health provides references for guiding conservation activities. However, current labor-intensive methods result in a backlog of unsorted images, highlighting the need for automated classification. Few studies have simultaneously utilized accurate annotations along with updated algorithms and datasets. This study aimed to create a dataset representing common coral conditions and associated stressors in the Indo-Pacific. Concurrently, it assessed existing classification algorithms and proposed a new multi-label method for automatically detecting coral conditions and extracting ecological information. A dataset containing over 20,000 high-resolution coral images of different health conditions and stressors was constructed based on the field survey. Seven representative deep learning architectures were tested on this dataset, and their performance was quantitatively evaluated using the F1 metric and the match ratio. Based on this evaluation, a new method utilizing the ensemble learning approach was proposed. The proposed method accurately classified coral conditions as healthy, compromised, dead, and rubble; it also identified corresponding stressors, including competition, disease, predation, and physical issues. This method can help develop the coral image archive, guide conservation activities, and provide references for decision-making for reef managers and conservationists. The proposed ensemble learning approach outperforms others on the dataset, showing State-Of-The-Art (SOTA) performance. Future research should improve its generalizability and accuracy to support global coral conservation efforts.
There has been a growing interest in recent years in modelling multiple modalities (or views) of data to for example, understand the relationship between modalities or to generate missing data. Multi-view autoencoders have gained significant traction for their adaptability and versatility in modelling multi-modal data, demonstrating an ability to tailor their approach to suit the characteristics of the data at hand. However, most multi-view autoencoders have inconsistent notation and are often implemented using different coding frameworks. To address this, we present a unified mathematical framework for multi-view autoencoders, consolidating their formulations. Moreover, we offer insights into the motivation and theoretical advantages of each model. To facilitate accessibility and practical use, we extend the documentation and functionality of the previously introduced \texttt{multi-view-AE} library. This library offers Python implementations of numerous multi-view autoencoder models, presented within a user-friendly framework. Through benchmarking experiments, we evaluate our implementations against previous ones, demonstrating comparable or superior performance. This work aims to establish a cohesive foundation for multi-modal modelling, serving as a valuable educational resource in the field.
Most businesses impose a supervisory hierarchy on employees to facilitate management, decision-making, and collaboration, yet routine inter-employee communication patterns within workplaces tend to emerge more naturally as a consequence of both supervisory relationships and the needs of the organization. What then is the relationship between a formal organizational structure and the emergent communications between its employees? Understanding the nature of this relationship is critical for the successful management of an organization. While scholars of organizational management have proposed theories relating organizational trees to communication dynamics, and separately, network scientists have studied the topological structure of communication patterns in different types of organizations, existing empirical analyses are both lacking in representativeness and limited in size. In fact, much of the methodology used to study the relationship between organizational hierarchy and communication patterns comes from analyses of the Enron email corpus, reflecting a uniquely dysfunctional corporate environment. In this paper, we develop new methodology for assessing the relationship between organizational hierarchy and communication dynamics and apply it to Microsoft Corporation, currently the highest valued company in the world, consisting of approximately 200,000 employees divided into 88 teams. This reveals distinct communication network structures within and between teams. We then characterize the relationship of routine employee communication patterns to these team supervisory hierarchies, while empirically evaluating several theories of organizational management and performance. To do so, we propose new measures of communication reciprocity and new shortest-path distances for trees to track the frequency of messages passed up, down, and across the organizational hierarchy.
In decision-making, maxitive functions are used for worst-case and best-case evaluations. Maxitivity gives rise to a rich structure that is well-studied in the context of the pointwise order. In this article, we investigate maxitivity with respect to general preorders and provide a representation theorem for such functionals. The results are illustrated for different stochastic orders in the literature, including the usual stochastic order, the increasing convex/concave order, and the dispersive order.
Charts, figures, and text derived from data play an important role in decision making, from data-driven policy development to day-to-day choices informed by online articles. Making sense of, or fact-checking, outputs means understanding how they relate to the underlying data. Even for domain experts with access to the source code and data sets, this poses a significant challenge. In this paper we introduce a new program analysis framework which supports interactive exploration of fine-grained I/O relationships directly through computed outputs, making use of dynamic dependence graphs. Our main contribution is a novel notion in data provenance which we call related inputs, a relation of mutual relevance or "cognacy" which arises between inputs when they contribute to common features of the output. Queries of this form allow readers to ask questions like "What outputs use this data element, and what other data elements are used along with it?". We show how Jonsson and Tarski's concept of conjugate operators on Boolean algebras appropriately characterises the notion of cognacy in a dependence graph, and give a procedure for computing related inputs over such a graph.
Principal component analysis (PCA) is a simple and popular tool for processing high-dimensional data. We investigate its effectiveness for matrix denoising. We consider the clean data are generated from a low-dimensional subspace, but masked by independent high-dimensional sub-Gaussian noises with standard deviation $\sigma$. Under the low-rank assumption on the clean data with a mild spectral gap assumption, we prove that the distance between each pair of PCA-denoised data point and the clean data point is uniformly bounded by $O(\sigma \log n)$. To illustrate the spectral gap assumption, we show it can be satisfied when the clean data are independently generated with a non-degenerate covariance matrix. We then provide a general lower bound for the error of the denoised data matrix, which indicates PCA denoising gives a uniform error bound that is rate-optimal. Furthermore, we examine how the error bound impacts downstream applications such as clustering and manifold learning. Numerical results validate our theoretical findings and reveal the importance of the uniform error.
A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of evaluation targets with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous execution under fixed-budget conditions required for crowdsourcing. Our experiment on preference-based subjective evaluation of synthetic speech shows that our method successfully optimizes the test by reducing pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation accuracies and wasting budget allocations.
Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China.
Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.