Multivariate count data with many zeros frequently occur in a variety of application areas such as text mining with a document-term matrix and cluster analysis with microbiome abundance data. Exponential family PCA (Collins et al., 2001) is a widely used dimension reduction tool to understand and capture the underlying low-rank structure of count data. It produces principal component scores by fitting Poisson regression models with estimated loadings as covariates. This tends to result in extreme scores for sparse count data significantly deviating from true scores. We consider two major sources of bias in this estimation procedure and propose ways to reduce their effects. First, the discrepancy between true loadings and their estimates under a limited sample size largely degrades the quality of score estimates. By treating estimated loadings as covariates with bias and measurement errors, we debias score estimates, using the iterative bootstrap method for loadings and considering classical measurement error models. Second, the existence of MLE bias is often ignored in score estimation, but this bias could be removed through well-known MLE bias reduction methods. We demonstrate the effectiveness of the proposed bias correction procedure through experiments on both simulated data and real data.
Network telemetry based on data models is expected to become the standard mechanism for collecting operational data from network devices efficiently. But the wide variety of standard and proprietary data models along with the different implementations of telemetry protocols offered by network vendors, become a barrier when monitoring heterogeneous network infrastructures. To facilitate the integration and sharing of context information related to model-driven telemetry, this work proposes a semantic network inventory that integrates new information models specifically developed to capture context information in a vendor-agnostic fashion using current standards defined for context management. To automate the integration of this context information within the network inventory, a reference architecture is designed. Finally, a prototype of the solution is implemented and validated through a case study that illustrates how the network inventory can ease the operation of model-driven telemetry in multi-vendor networks.
While AI algorithms have shown remarkable success in various fields, their lack of transparency hinders their application to real-life tasks. Although explanations targeted at non-experts are necessary for user trust and human-AI collaboration, the majority of explanation methods for AI are focused on developers and expert users. Counterfactual explanations are local explanations that offer users advice on what can be changed in the input for the output of the black-box model to change. Counterfactuals are user-friendly and provide actionable advice for achieving the desired output from the AI system. While extensively researched in supervised learning, there are few methods applying them to reinforcement learning (RL). In this work, we explore the reasons for the underrepresentation of a powerful explanation method in RL. We start by reviewing the current work in counterfactual explanations in supervised learning. Additionally, we explore the differences between counterfactual explanations in supervised learning and RL and identify the main challenges that prevent the adoption of methods from supervised in reinforcement learning. Finally, we redefine counterfactuals for RL and propose research directions for implementing counterfactuals in RL.
The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.
The paper aims to address load imbalance caused by high in-degree distribution in graphs by applying the idea of rhizome to vertex-centric message-driven graph processing. Rhizome construction of the graph creates multiple named vertex address for any number of single large in-degree vertices. It then allows other vertices to point to any of the named addresses thus sharing the in-degree load. The rhizomes internally communicate and remain consistent to provide a unified and correct view of the vertex. Simulated experimental results show performance speed ups for BFS graph traversal on large chip sizes for the tested input graph datasets containing highly skewed in-degree distribution. The improvements come from sharing the in-degree compute workload among memory-processing elements and also lowering contention on the network-on-chip.
Message brokers often mediate communication between data producers and consumers by adding variable-sized messages to ordered distributed queues. Our goal is to determine the number of consumers and consumer-partition assignments needed to ensure that the rate of data consumption keeps up with the rate of data production. We model the problem as a variable item size bin packing problem. As the rate of production varies, new consumer-partition assignments are computed, which may require rebalancing a partition from one consumer to another. While rebalancing a queue, the data being produced into the queue is not read leading to additional latency costs. As such, we focus on the multi-objective optimization cost of minimizing both the number of consumers and queue migrations. We present a variety of algorithms and compare them to established bin packing heuristics for this application. Comparing our proposed consumer group assignment strategy with Kafka's, a commonly employed strategy, our strategy presents a 90th percentile latency of 4.52s compared to Kafka's 217s with both using the same amount of consumers. Kafka's assignment strategy only improved the consumer group's performance with regards to latency with configurations that used at least 60% more resources than our approach.
The graph invariant EPT-sum has cropped up in several unrelated fields in later years: As an objective function for hierarchical clustering, as a more fine-grained version of the classical edge ranking problem, and, specifically when the input is a vertex-weighted tree, as a measure of average/expected search length in a partially ordered set. The EPT-sum of a graph $G$ is defined as the minimum sum of the depth of every leaf in an edge partition tree (EPT), a rooted tree where leaves correspond to vertices in $G$ and internal nodes correspond to edges in $G$. A simple algorithm that approximates EPT-sum on trees is given by recursively choosing the most balanced edge in the input tree $G$ to build an EPT of $G$. Due to its fast runtime, this balanced cut algorithm is used in practice. In this paper, we show that the balanced cut algorithm gives a 1.5-approximation of EPT-sum on trees, which amounts to a tight analysis and answers a question posed by Cicalese et al. in 2014.
Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. Existing works show that appropriate prompt design, such as Chain-of-Thoughts, can unlock LLM's powerful capacity in diverse areas. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, existing prompting strategies either suffers from insufficient expressive power or intermediate errors triggered by hallucination. To make LLM more discerning to such intermediate errors, we propose to guide LLM with a Divide-and-Conquer program that simultaneously ensures superior expressive power and disentangles task decomposition, sub-task resolution, and resolution assembly process. Theoretic analysis reveals that our strategy can guide LLM to extend the expressive power of fixed-depth Transformer. Experiments indicate that our proposed method can achieve better performance than typical prompting strategies in tasks bothered by intermediate errors and deceptive contents, such as large integer multiplication, hallucination detection and misinformation detection.
A possible world of an incomplete database table is obtained by imputing values from the attributes (infinite) domain to the place of \texttt{NULL} s. A table satisfies a possible key or possible functional dependency constraint if there exists a possible world of the table that satisfies the given key or functional dependency constraint. A certain key or functional dependency is satisfied by a table if all of its possible worlds satisfy the constraint. Recently, an intermediate concept was introduced. A strongly possible key or functional dependency is satisfied by a table if there exists a strongly possible world that satisfies the key or functional dependency. A strongly possible world is obtained by imputing values from the active domain of the attributes, that is from the values appearing in the table. In the present paper, we study approximation measures of strongly possible keys and FDs. Measure $g_3$ is the ratio of the minimum number of tuples to be removed in order that the remaining table satisfies the constraint. We introduce a new measure $g_5$, the ratio of the minimum number of tuples to be added to the table so the result satisfies the constraint. $g_5$ is meaningful because the addition of tuples may extend the active domains. We prove that if $g_5$ can be defined for a table and a constraint, then the $g_3$ value is always an upper bound of the $g_5$ value. However, the two measures are independent of each other in the sense that for any rational number $0\le\frac{p}{q}<1$ there are tables of an arbitrarily large number of rows and a constant number of columns that satisfy $g_3-g_5=\frac{p}{q}$. A possible world is obtained usually by adding many new values not occurring in the table before. The measure $g_5$ measures the smallest possible distortion of the active domains. We study complexity of determining these approximate measures.
Multi-modal sensor data fusion takes advantage of complementary or reinforcing information from each sensor and can boost overall performance in applications such as scene classification and target detection. This paper presents a new method for fusing multi-modal and multi-resolution remote sensor data without requiring pixel-level training labels, which can be difficult to obtain. Previously, we developed a Multiple Instance Multi-Resolution Fusion (MIMRF) framework that addresses label uncertainty for fusion, but it can be slow to train due to the large search space for the fuzzy measures used to integrate sensor data sources. We propose a new method based on binary fuzzy measures, which reduces the search space and significantly improves the efficiency of the MIMRF framework. We present experimental results on synthetic data and a real-world remote sensing detection task and show that the proposed MIMRF-BFM algorithm can effectively and efficiently perform multi-resolution fusion given remote sensing data with uncertainty.
Most commonly used $f$-divergences of measures, e.g., the Kullback-Leibler divergence, are subject to limitations regarding the support of the involved measures. A remedy consists of regularizing the $f$-divergence by a squared maximum mean discrepancy (MMD) associated with a characteristic kernel $K$. In this paper, we use the so-called kernel mean embedding to show that the corresponding regularization can be rewritten as the Moreau envelope of some function in the reproducing kernel Hilbert space associated with $K$. Then, we exploit well-known results on Moreau envelopes in Hilbert spaces to prove properties of the MMD-regularized $f$-divergences and, in particular, their gradients. Subsequently, we use our findings to analyze Wasserstein gradient flows of MMD-regularized $f$-divergences. Finally, we consider Wasserstein gradient flows starting from empirical measures and provide proof-of-the-concept numerical examples with Tsallis-$\alpha$ divergences.