DNA strands serve as a storage medium for $4$-ary data over the alphabet $\{A,T,G,C\}$. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low $GC$-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet $\{A,T,G,C\}$ with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.
The shuffle model of differential privacy (DP) has recently emerged as a powerful one for decentralized computation without fully trustable parties. Since it anonymizes and permutes messages from clients through a shuffler, the privacy can be amplified and utility can be improved. However, the shuffling procedure in turn restricts its applications only to statistical tasks that are permutation-invariant. This work explores the feasibility of shuffle privacy amplification for prevalent non-statistical computations: spatial crowdsourcing, combinatorial optimization, location-based social systems, and federated learning with incentives, which suffer either computationally intractability or intolerable utility loss in existing approaches (e.g., secure MPC and local DP). We proposes a new paradigm of shuffle model that can provide critical security functionalities like message authorization and result access control, meanwhile maintaining the most of privacy amplification effects. It incurs almost the same computation/communication costs as the non-private setting, and permits the server to run arbitrary algorithms on (noisy) client information in plaintext. Our novel technique is introducing statistically random identity into DP and force identical random distribution on all clients, so as to support secure functionalities even after message shuffling and to maintain privacy amplification simultaneously. Given that existing DP randomizers fails in the new shuffle model, we also propose a new mechanism and prove its optimality therein. Experimental results on spatial crowdsourcing, location-based social system, and federated learning with incentives, show that our paradigm and mechanism is fast as non-private settings, while reducing up to 90% error and increasing utility performance indicates by 100%-300% relatively, and can be practical under reasonable privacy budget.
Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this paper, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models. The code is available at //github.com/irisdominguez/dataset_bias_metrics.
Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user's nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.
In graph theory, the minimum directed feedback vertex set (FVS) problem consists in identifying the smallest subsets of vertices in a directed graph whose deletion renders the directed graph acyclic. Although being known as NP-hard since 1972, this problem can be solved in a reasonable time on small instances, or on instances having special combinatorial structure. In this paper we investigate graph reductions preserving all or some minimum FVS and focus on their properties, especially the Church-Rosser property, also called confluence. The Church-Rosser property implies the irrelevance of reduction order, leading to a unique directed graph. The study seeks the largest subset of reductions with the Church-Rosser property and explores the adaptability of reductions to meet this criterion. Addressing these questions is crucial, as it may impact algorithmic implications, allowing for parallelization and speeding up sequential algorithms.
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: //huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.
It is increasingly common to evaluate the same coreference resolution (CR) model on multiple datasets. Do these multi-dataset evaluations allow us to draw meaningful conclusions about model generalization? Or, do they rather reflect the idiosyncrasies of a particular experimental setup (e.g., the specific datasets used)? To study this, we view evaluation through the lens of measurement modeling, a framework commonly used in the social sciences for analyzing the validity of measurements. By taking this perspective, we show how multi-dataset evaluations risk conflating different factors concerning what, precisely, is being measured. This in turn makes it difficult to draw more generalizable conclusions from these evaluations. For instance, we show that across seven datasets, measurements intended to reflect CR model generalization are often correlated with differences in both how coreference is defined and how it is operationalized; this limits our ability to draw conclusions regarding the ability of CR models to generalize across any singular dimension. We believe the measurement modeling framework provides the needed vocabulary for discussing challenges surrounding what is actually being measured by CR evaluations.
The limited size of pain datasets are a challenge in developing robust deep learning models for pain recognition. Transfer learning approaches are often employed in these scenarios. In this study, we investigate whether deep learned feature representation for one type of experimentally induced pain can be transferred to another. Participating in the AI4Pain challenge, our goal is to classify three levels of pain (No-Pain, Low-Pain, High-Pain). The challenge dataset contains data collected from 65 participants undergoing varying intensities of electrical pain. We utilize the video recording from the dataset to investigate the transferability of deep learned heat pain model to electrical pain. In our proposed approach, we leverage an existing heat pain convolutional neural network (CNN) - trained on BioVid dataset - as a feature extractor. The images from the challenge dataset are inputted to the pre-trained heat pain CNN to obtain feature vectors. These feature vectors are used to train two machine learning models: a simple feed-forward neural network and a long short-term memory (LSTM) network. Our approach was tested using the dataset's predefined training, validation, and testing splits. Our models outperformed the baseline of the challenge on both the validation and tests sets, highlighting the potential of models trained on other pain datasets for reliable feature extraction.
The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.