For codes equipped with metrics such as Hamming metric, symbol pair metric or cover metric, the Johnson bound guarantees list-decodability of such codes. That is, the Johnson bound provides a lower bound on the list-decoding radius of a code in terms of its relative minimum distance $\delta$, list size $L$ and the alphabet size $q.$ For study of list-decodability of codes with insertion and deletion errors (we call such codes insdel codes), it is natural to ask the open problem whether there is also a Johnson-type bound. The problem was first investigated by Wachter-Zeh and the result was amended by Hayashi and Yasunaga where a lower bound on the list-decodability for insdel codes was derived. The main purpose of this paper is to move a step further towards solving the above open problem. In this work, we provide a new lower bound for the list-decodability of an insdel code. As a consequence, we show that unlike the Johnson bound for codes under other metrics that is tight, the bound on list-decodability of insdel codes given by Hayashi and Yasunaga is not tight. Our main idea is to show that if an insdel code with a given Levenshtein distance $d$ is not list-decodable with list size $L$, then the list decoding radius is lower bounded by a bound involving $L$ and $d$. In other words, if the list decoding radius is less than this lower bound, the code must be list-decodable with list size $L$. At the end of the paper we use such bound to provide an insdel-list-decodability bound for various well-known codes, which has not been extensively studied before.
Our work addresses the critical issue of distinguishing text generated by Large Language Models (LLMs) from human-produced text, a task essential for numerous applications. Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. We establish precise sample complexity bounds for detecting AI-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. Our empirical evaluations across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability of enhanced detection methods. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI's empirical data related to sequence length, marking the first theoretical substantiation for these observations.
Change blindness is a phenomenon where an individual fails to notice alterations in a visual scene when a change occurs during a brief interruption or distraction. Understanding this phenomenon is specifically important for the technique that uses a visual stimulus, such as Virtual Reality (VR) or Augmented Reality (AR). Previous research had primarily focused on 2D environments or conducted limited controlled experiments in 3D immersive environments. In this paper, we design and conduct two formal user experiments to investigate the effects of different visual attention-disrupting conditions (Flickering and Head-Turning) and object alternative conditions (Removal, Color Alteration, and Size Alteration) on change blindness detection in VR and AR environments. Our results reveal that participants detected changes more quickly and had a higher detection rate with Flickering compared to Head-Turning. Furthermore, they spent less time detecting changes when an object disappeared compared to changes in color or size. Additionally, we provide a comparison of the results between VR and AR environments.
Deep neural networks are over-parameterized and easily overfit the datasets they train on. In the extreme case, it has been shown that these networks can memorize a training set with fully randomized labels. We propose using the curvature of loss function around each training sample, averaged over training epochs, as a measure of memorization of the sample. We use this metric to study the generalization versus memorization properties of different samples in popular image datasets and show that it captures memorization statistics well, both qualitatively and quantitatively. We first show that the high curvature samples visually correspond to long-tailed, mislabeled, or conflicting samples, those that are most likely to be memorized. This analysis helps us find, to the best of our knowledge, a novel failure mode on the CIFAR100 and ImageNet datasets: that of duplicated images with differing labels. Quantitatively, we corroborate the validity of our scores via two methods. First, we validate our scores against an independent and comprehensively calculated baseline, by showing high cosine similarity with the memorization scores released by Feldman and Zhang (2020). Second, we inject corrupted samples which are memorized by the network, and show that these are learned with high curvature. To this end, we synthetically mislabel a random subset of the dataset. We overfit a network to it and show that sorting by curvature yields high AUROC values for identifying the corrupted samples. An added advantage of our method is that it is scalable, as it requires training only a single network as opposed to the thousands trained by the baseline, while capturing the aforementioned failure mode that the baseline fails to identify.
In exterior calculus on smooth manifolds, the exterior derivative and wedge product are natural with respect to smooth maps between manifolds, that is, these operations commute with pullback. In discrete exterior calculus (DEC), simplicial cochains play the role of discrete forms, the coboundary operator serves as the discrete exterior derivative, and the antisymmetrized cup product provides a discrete wedge product. In this paper we show that these discrete operations in DEC are natural with respect to abstract simplicial maps. A second contribution is a new combinatorial averaging interpretation of the discrete wedge product in DEC.
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs and strong requirements for the data set and accessibility of the face recognition model. By analyzing the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively, and our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process.
We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, and substantially smaller in strongly correlated datasets compared to uncorrelated synthetic data, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.
While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? Hence, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to critically analyze multimodal representations. QUAG facilitates combined dataset-model study by systematic ablation of model's coupled multimodal understanding during inference. Surprisingly, it demonstrates that the models manage to maintain high performance even under multimodal impairment. We extend QUAG to design "QUAG-attention", a simplistic and less-expressive replacement of self-attention. We find that the models with QUAG-attention achieve similar performance with significantly less mulops without any finetuning. These findings indicate that the current VideoQA benchmarks and metrics do not penalize models that find shortcuts and discount joint multimodal understanding. Motivated by this, we propose the $\textit{CLAVI}$ (Counterfactual in LAnguage and VIdeo), a diagnostic dataset for coupled multimodal understanding in VideoQA. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have poor performance on the counterfactual instances that necessitate joint multimodal understanding. Overall, with the multimodal representation analysis using QUAG and diagnostic analysis using CLAVI, we show that many VideoQA models are incapable of learning multimodal representations and that their success on standard datasets is an illusion of joint multimodal understanding.
The optimal branch number of MDS matrices has established their prominence in the design of diffusion layers for various block ciphers and hash functions. Consequently, several matrix structures have been proposed for designing MDS matrices, including Hadamard and circulant matrices. In this paper, we first provide the count of Hadamard MDS matrices of order $4$ over the field $\mathbb{F}_{2^r}$. Subsequently, we present the counts of order $2$ MDS matrices and order $2$ involutory MDS matrices over the field $\mathbb{F}_{2^r}$. Finally, leveraging these counts of order $2$ matrices, we derive an upper bound for the number of all involutory MDS matrices of order $4$ over $\mathbb{F}_{2^r}$.
The accurate representation and prediction of physical phenomena through numerical computer codes remains to be a vast and intricate interdisciplinary topic of research. Especially within the last decades, there has been a considerable push toward high performance numerical schemes to solve partial differential equations (PDEs) from the applied mathematics and numerics community. The resulting landscape of choices regarding numerical schemes for a given system of PDEs can thus easily appear daunting for an application expert that is familiar with the relevant physics, but not necessarily with the numerics. Bespoke high performance schemes in particular pose a substantial hurdle for domain scientists regarding their theory and implementation. Here, we propose a unifying scheme for grid based approximation methods to address this issue. We introduce some well defined restrictions to systematically guide an application expert through the process of classifying a given multiphysics problem, identifying suitable numerical schemes and implementing them. We introduce a fixed set of input parameters, amongst them for example the governing equations and the hardware configuration. This method not only helps to identify and assemble suitable schemes, but enables the unique combination of multiple methods on a per field basis. We exemplarily demonstrate this process and its effectiveness using different approaches and systematically show how one should exploit some given properties of a PDE problem to arrive at an efficient compound discretisation.
Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid speech anti-spoofing. On the other hand, distribution differences in inter-utterance speaker features can be utilized for SSD. The proposed method offers low computational complexity and performs well in both cross-dataset and silence trimming scenarios.