Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning. An exponential family can either be normalized subtractively by its cumulant function or equivalently normalized divisively by its partition function. Both subtractive and divisive normalizers are strictly convex and smooth functions inducing pairs of Bregman and Jensen divergences. It is well-known that skewed Bhattacharryya distances between probability densities of an exponential family amounts to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and in limit cases that the sided Kullback-Leibler divergences amount to reverse-sided Bregman divergences. In this note, we first show that the $\alpha$-divergences between unnormalized densities of an exponential family amounts scaled $\alpha$-skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetic means allows to deform convex functions and define dually flat spaces with corresponding divergences when ordinary convexity is preserved.
A fundamental theme in automata theory is regular languages of words and trees, and their many equivalent definitions. Salvati has proposed a generalization to regular languages of simply typed $\lambda$-terms, defined using denotational semantics in finite sets. We provide here some evidence for its robustness. First, we give an equivalent syntactic characterization that naturally extends the seminal work of Hillebrand and Kanellakis connecting regular languages of words and syntactic $\lambda$-definability. Second, we show that any finitary extensional model of the simply typed $\lambda$-calculus, when used in Salvati's definition, recognizes exactly the same class of languages of $\lambda$-terms as the category of finite sets does. The proofs of these two results rely on logical relations and can be seen as instances of a more general construction of a categorical nature, inspired by previous categorical accounts of logical relations using the gluing construction.
Speech technologies rely on capturing a speaker's voice variability while obtaining comprehensive language information. Textual prompts and sentence selection methods have been proposed in the literature to comprise such adequate phonetic data, referred to as a phonetically rich \textit{corpus}. However, they are still insufficient for acoustic modeling, especially critical for languages with limited resources. Hence, this paper proposes a novel approach and outlines the methodological aspects required to create a \textit{corpus} with broad phonetic coverage for a low-resourced language, Brazilian Portuguese. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Furthermore, we propose a new phonemic classification according to acoustic-articulatory speech features since the absolute number of distinct triphones, or low-probability triphones, does not guarantee an adequate representation of every possible combination. Using our algorithm, we achieve a 55.8\% higher percentage of distinct triphones -- for samples of similar size -- while the currently available phonetic-rich corpus, CETUC and TTS-Portuguese, 12.6\% and 12.3\% in comparison to a non-phonetically rich dataset.
A deep generative model yields an implicit estimator for the unknown distribution or density function of the observation. This paper investigates some statistical properties of the implicit density estimator pursued by VAE-type methods from a nonparametric density estimation framework. More specifically, we obtain convergence rates of the VAE-type density estimator under the assumption that the underlying true density function belongs to a locally H\"{o}lder class. Remarkably, a near minimax optimal rate with respect to the Hellinger metric can be achieved by the simplest network architecture, a shallow generative model with a one-dimensional latent variable.
Deep denoisers have shown excellent performance in solving inverse problems in signal and image processing. In order to guarantee the convergence, the denoiser needs to satisfy some Lipschitz conditions like non-expansiveness. However, enforcing such constraints inevitably compromises recovery performance. This paper introduces a novel training strategy that enforces a weaker constraint on the deep denoiser called pseudo-contractiveness. By studying the spectrum of the Jacobian matrix, relationships between different denoiser assumptions are revealed. Effective algorithms based on gradient descent and Ishikawa process are derived, and further assumptions of strict pseudo-contractiveness yield efficient algorithms using half-quadratic splitting and forward-backward splitting. The proposed algorithms theoretically converge strongly to a fixed point. A training strategy based on holomorphic transformation and functional calculi is proposed to enforce the pseudo-contractive denoiser assumption. Extensive experiments demonstrate superior performance of the pseudo-contractive denoiser compared to related denoisers. The proposed methods are competitive in terms of visual effects and quantitative values.
The logistic regression model is one of the most popular data generation model in noisy binary classification problems. In this work, we study the sample complexity of estimating the parameters of the logistic regression model up to a given $\ell_2$ error, in terms of the dimension and the inverse temperature, with standard normal covariates. The inverse temperature controls the signal-to-noise ratio of the data generation process. While both generalization bounds and asymptotic performance of the maximum-likelihood estimator for logistic regression are well-studied, the non-asymptotic sample complexity that shows the dependence on error and the inverse temperature for parameter estimation is absent from previous analyses. We show that the sample complexity curve has two change-points in terms of the inverse temperature, clearly separating the low, moderate, and high temperature regimes.
Objective: Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. Many prediction models are deployed for decision support based on their prediction accuracy in validation studies. We investigate whether this is a safe and valid approach. Materials and Methods: We show that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model. Results: Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution. Discussion: Our results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions. Conclusion: Outcome prediction models can yield harmful self-fulfilling prophecies when used for decision making, a new perspective on prediction model development, deployment and monitoring is needed.
Block Principal Component Analysis (Block PCA) of a data matrix A, where loadings Z are determined by maximization of AZ 2 over unit norm orthogonal loadings, is difficult to use for the design of sparse PCA by 1 regularization, due to the difficulty of taking care of both the orthogonality constraint on loadings and the non differentiable 1 penalty. Our objective in this paper is to relax the orthogonality constraint on loadings by introducing new objective functions expvar(Y) which measure the part of the variance of the data matrix A explained by correlated components Y = AZ. So we propose first a comprehensive study of mathematical and numerical properties of expvar(Y) for two existing definitions Zou et al. [2006], Shen and Huang [2008] and four new definitions. Then we show that only two of these explained variance are fit to use as objective function in block PCA formulations for A rid of orthogonality constraints.
Graph representation learning (GRL) is critical for extracting insights from complex network structures, but it also raises security concerns due to potential privacy vulnerabilities in these representations. This paper investigates the structural vulnerabilities in graph neural models where sensitive topological information can be inferred through edge reconstruction attacks. Our research primarily addresses the theoretical underpinnings of cosine-similarity-based edge reconstruction attacks (COSERA), providing theoretical and empirical evidence that such attacks can perfectly reconstruct sparse Erdos Renyi graphs with independent random features as graph size increases. Conversely, we establish that sparsity is a critical factor for COSERA's effectiveness, as demonstrated through analysis and experiments on stochastic block models. Finally, we explore the resilience of (provably) private graph representations produced via noisy aggregation (NAG) mechanism against COSERA. We empirically delineate instances wherein COSERA demonstrates both efficacy and deficiency in its capacity to function as an instrument for elucidating the trade-off between privacy and utility.
Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced data sets. Asymptotically, we prove that SMOTE (with default parameter) regenerates the original distribution by simply copying the original minority samples. We also prove that SMOTE density vanishes near the boundary of the support of the minority distribution, therefore justifying the common BorderLine SMOTE strategy. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. We show that rebalancing strategies are only required when the data set is highly imbalanced. For such data sets, SMOTE, our proposals, or undersampling procedures are the best strategies.
This paper investigates the language of propaganda and its stylistic features. It presents the PPN dataset, standing for Propagandist Pseudo-News, a multisource, multilingual, multimodal dataset composed of news articles extracted from websites identified as propaganda sources by expert agencies. A limited sample from this set was randomly mixed with papers from the regular French press, and their URL masked, to conduct an annotation-experiment by humans, using 11 distinct labels. The results show that human annotators were able to reliably discriminate between the two types of press across each of the labels. We propose different NLP techniques to identify the cues used by the annotators, and to compare them with machine classification. They include the analyzer VAGO to measure discourse vagueness and subjectivity, a TF-IDF to serve as a baseline, and four different classifiers: two RoBERTa-based models, CATS using syntax, and one XGBoost combining syntactic and semantic features. Keywords: Propaganda, Fake News, Explainability, AI alignment, Vagueness, Subjectivity, Exaggeration, Stylistic analysis