How can citizens address hate in online discourse? We analyze a large corpus of more than 130,000 discussions on Twitter over four years. With the help of human annotators, language models and machine learning classifiers, we identify different dimensions of discourse that might be related to the probability of hate speech in subsequent tweets. We use a matching approach and longitudinal statistical analyses to discern the effectiveness of different counter speech strategies on the micro-level (individual tweet pairs), meso-level (discussion trees) and macro-level (days) of discourse. We find that expressing simple opinions, not necessarily supported by facts, but without insults, relates to the least hate in subsequent discussions. Sarcasm can be helpful as well, in particular in the presence of organized extreme groups. Mentioning either outgroups or ingroups is typically related to a deterioration of discourse. A pronounced emotional tone, either negative such as anger or fear, or positive such as enthusiasm and pride, also leads to worse discourse quality. We obtain similar results for other measures of quality of discourse beyond hate speech, including toxicity, extremity of speech, and the presence of extreme speakers. Going beyond one-shot analyses on smaller samples of discourse, our findings have implications for the successful management of online commons through collective civic moderation.
Conventional neural network elastoplasticity models are often perceived as lacking interpretability. This paper introduces a two-step machine learning approach that returns mathematical models interpretable by human experts. In particular, we introduce a surrogate model where yield surfaces are expressed in terms of a set of single-variable feature mappings obtained from supervised learning. A post-processing step is then used to re-interpret the set of single-variable neural network mapping functions into mathematical form through symbolic regression. This divide-and-conquer approach provides several important advantages. First, it enables us to overcome the scaling issue of symbolic regression algorithms. From a practical perspective, it enhances the portability of learned models for partial differential equation solvers written in different programming languages. Finally, it enables us to have a concrete understanding of the attributes of the materials, such as convexity and symmetries of models, through automated derivations and reasoning. Numerical examples have been provided, along with an open-source code to enable third-party validation.
This work has been motivated by a longitudinal data set on HIV CD4 T+ cell counts from Livingstone district, Zambia. The corresponding histogram plots indicate lack of symmetry in the marginal distributions and the pairwise scatter plots show non-elliptical dependence patterns. The standard linear mixed model for longitudinal data fails to capture these features. Thus it seems appropriate to consider a more general framework for modeling such data. In this article, we consider generalized linear mixed models (GLMM) for the marginals (e.g. Gamma mixed model), and temporal dependency of the repeated measurements is modeled by the copula corresponding to some skew-elliptical distributions (like skew-normal/skew-t). Our proposed class of copula based mixed models simultaneously takes into account asymmetry, between-subject variability and non-standard temporal dependence, and hence can be considered extensions to the standard linear mixed model based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and also describe how to obtain standard errors of the parameter estimates. We investigate the finite sample performance of our procedure with extensive simulation studies involving skewed and symmetric marginal distributions and several choices of the copula. We finally apply our models to the HIV data set and report the findings.
Despite substantial efforts, neural network interpretability remains an elusive goal, with previous research failing to provide succinct explanations of most single neurons' impact on the network output. This limitation is due to the polysemantic nature of most neurons, whereby a given neuron is involved in multiple unrelated network states, complicating the interpretation of that neuron. In this paper, we apply tools developed in neuroscience and information theory to propose both a novel practical approach to network interpretability and theoretical insights into polysemanticity and the density of codes. We infer levels of redundancy in the network's code by inspecting the eigenspectrum of the activation's covariance matrix. Furthermore, we show how random projections can reveal whether a network exhibits a smooth or non-differentiable code and hence how interpretable the code is. This same framework explains the advantages of polysemantic neurons to learning performance and explains trends found in recent results by Elhage et al.~(2022). Our approach advances the pursuit of interpretability in neural networks, providing insights into their underlying structure and suggesting new avenues for circuit-level interpretability.
Single-chain Markov chain Monte Carlo simulates realizations from a Markov chain to estimate expectations with the empirical average. The single-chain simulation is generally of considerable length and restricts many advantages of modern parallel computation. This paper constructs a novel many-short-chains Monte Carlo (MSC) estimator by averaging over multiple independent sums from Markov chains of a guaranteed short length. The computational advantage is the independent Markov chain simulations can be fast and may be run in parallel. The MSC estimator requires an importance sampling proposal and a drift condition on the Markov chain without requiring convergence analysis on the Markov chain. A non-asymptotic error analysis is developed for the MSC estimator under both geometric and multiplicative drift conditions. Empirical performance is illustrated on an autoregressive process and the P\'olya-Gamma Gibbs sampler for Bayesian logistic regression to predict cardiovascular disease.
We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.
Data generation remains a bottleneck in training surrogate models to predict molecular properties. We demonstrate that multitask Gaussian process regression overcomes this limitation by leveraging both expensive and cheap data sources. In particular, we consider training sets constructed from coupled-cluster (CC) and density function theory (DFT) data. We report that multitask surrogates can predict at CC level accuracy with a reduction to data generation cost by over an order of magnitude. Of note, our approach allows the training set to include DFT data generated by a heterogeneous mix of exchange-correlation functionals without imposing any artificial hierarchy on functional accuracy. More generally, the multitask framework can accommodate a wider range of training set structures -- including full disparity between the different levels of fidelity -- than existing kernel approaches based on $\Delta$-learning, though we show that the accuracy of the two approaches can be similar. Consequently, multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
Introduction: Heterogeneity of the progression of neurodegenerative diseases is one of the main challenges faced in developing effective therapies. With the increasing number of large clinical databases, disease progression models have led to a better understanding of this heterogeneity. Nevertheless, these diseases may have no clear onset and biological underlying processes may start before the first symptoms. Such an ill-defined disease reference time is an issue for current joint models, which have proven their effectiveness by combining longitudinal and survival data. Objective In this work, we propose a joint non-linear mixed effect model with a latent disease age, to overcome this need for a precise reference time. Method: To do so, we utilized an existing longitudinal model with a latent disease age as a longitudinal sub-model and associated it with a survival sub-model that estimates a Weibull distribution from the latent disease age. We then validated our model on different simulated scenarios. Finally, we benchmarked our model with a state-of-the-art joint model and reference survival and longitudinal models on simulated and real data in the context of Amyotrophic Lateral Sclerosis (ALS). Results: On real data, our model got significantly better results than the state-of-the-art joint model for absolute bias (4.21(4.41) versus 4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events (0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)). Conclusion: We showed that our approach is better suited than the state-of-the-art in the context where the reference time is not reliable. This work opens up the perspective to design predictive and personalized therapeutic strategies.
Deep learning models have demonstrated promising results in estimating treatment effects (TEE). However, most of them overlook the variations in treatment outcomes among subgroups with distinct characteristics. This limitation hinders their ability to provide accurate estimations and treatment recommendations for specific subgroups. In this study, we introduce a novel neural network-based framework, named SubgroupTE, which incorporates subgroup identification and treatment effect estimation. SubgroupTE identifies diverse subgroups and simultaneously estimates treatment effects for each subgroup, improving the treatment effect estimation by considering the heterogeneity of treatment responses. Comparative experiments on synthetic data show that SubgroupTE outperforms existing models in treatment effect estimation. Furthermore, experiments on a real-world dataset related to opioid use disorder (OUD) demonstrate the potential of our approach to enhance personalized treatment recommendations for OUD patients.
We provide a nonasymptotic analysis of the convergence of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) to a target measure in Wasserstein-2 distance without assuming log-concavity. Our analysis quantifies key theoretical properties of the SGHMC as a sampler under local conditions which significantly improves the findings of previous results. In particular, we prove that the Wasserstein-2 distance between the target and the law of the SGHMC is uniformly controlled by the step-size of the algorithm, therefore demonstrate that the SGHMC can provide high-precision results uniformly in the number of iterations. The analysis also allows us to obtain nonasymptotic bounds for nonconvex optimization problems under local conditions and implies that the SGHMC, when viewed as a nonconvex optimizer, converges to a global minimum with the best known rates. We apply our results to obtain nonasymptotic bounds for scalable Bayesian inference and nonasymptotic generalization bounds.
Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.