99热日韩这里只有国产中文精品-亚洲国产A精品一区不卡

In this paper, we investigate the problem of deciding whether two standard normal random vectors $\mathsf{X}\in\mathbb{R}^{n}$ and $\mathsf{Y}\in\mathbb{R}^{n}$ are correlated or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these vectors are statistically independent, while under the alternative, $\mathsf{X}$ and a randomly and uniformly permuted version of $\mathsf{Y}$, are correlated with correlation $\rho$. We analyze the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of $n$ and $\rho$. To derive our information-theoretic lower bounds, we develop a novel technique for evaluating the second moment of the likelihood ratio using an orthogonal polynomials expansion, which among other things, reveals a surprising connection to integer partition functions. We also study a multi-dimensional generalization of the above setting, where rather than two vectors we observe two databases/matrices, and furthermore allow for partial correlations between these two.

相關內容

相關系數

關注 0

掩碼 · binary · 極小點 · SODA · 確切的 ·

2024 年 3 月 8 日

Pattern Masking for Dictionary Matching

Panagiotis Charalampopoulos,Huiping Chen,Peter Christen,Grigorios Loukides,Nadia Pisanti,Solon P. Pissis,Jakub Radoszewski

from arxiv, Published in Algorithmica. Abstract abridged due to arXiv requirements

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from $\mathcal{D}$. The PMDM problem lies at the heart of two important applications featured in large-scale real-world systems: record linkage of databases that contain sensitive information, and query term dropping. In both applications, solving PMDM allows for providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We present a data structure for PMDM that answers queries over $\mathcal{D}$ in time $\mathcal{O}(2^{\ell/2}(2^{\ell/2}+\tau)\ell)$ and requires space $\mathcal{O}(2^{\ell}d^2/\tau^2+2^{\ell/2}d)$, for any parameter $\tau\in[1,d]$. We also approach the problem from a more practical perspective. We show an $\mathcal{O}((d\ell)^{k/3}+d\ell)$-time and $\mathcal{O}(d\ell)$-space algorithm for PMDM if $k=|K|=\mathcal{O}(1)$. We generalize our exact algorithm to mask multiple query strings simultaneously. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamt\'{a}\v{c} et al., SODA 2017]. This gives a polynomial-time $\mathcal{O}(d^{1/4+\epsilon})$-approximation algorithm for PMDM, which is tight under plausible complexity conjectures.

正則化項 · 圖 · Conformer · 正則表達式 · 代價 ·

2024 年 3 月 7 日

Distinct Shortest Walk Enumeration for RPQs

Claire David,Nadime Francis,Victor Marsault

from arxiv, 14 pages

We consider the Distinct Shortest Walks problem. Given two vertices $s$ and $t$ of a graph database $\mathcal{D}$ and a regular path query, enumerate all walks of minimal length from $s$ to $t$ that carry a label that conforms to the query. Usual theoretical solutions turn out to be inefficient when applied to graph models that are closer to real-life systems, in particular because edges may carry multiple labels. Indeed, known algorithms may repeat the same answer exponentially many times. We propose an efficient algorithm for multi-labelled graph databases. The preprocessing runs in $O{|\mathcal{D}|\times|\mathcal{A}|}$ and the delay between two consecutive outputs is in $O(\lambda\times|\mathcal{A}|)$, where $\mathcal{A}$ is a nondeterministic automaton representing the query and $\lambda$ is the minimal length. The algorithm can handle $\varepsilon$-transitions in $\mathcal{A}$ or queries given as regular expressions at no additional cost.

可約的 · 優化器 · INFORMS · Analysis · Networking ·

2024 年 3 月 7 日

Combinatorial Analysis of Coded Caching Schemes

Ruizhong Wei

from arxiv, 26 pages

Coded caching schemes are used to reduce computer network traffics in peak time. To determine the efficiency of the schemes, \cite{MN} defined the information rate of the schemes and gave a construction of optimal coded caching schemes. However, their construction needs to split the data into a large number of packets which may cause constraints in real applications. Many researchers then constructed new coded caching schemes to reduce the number of packets but that increased the information rate. We define an optimization of coded caching schemes under the limitation of the number of packets which may be used to verify the efficiency of these schemes. We also give some constructions for several infinite classes of optimal coded caching schemes under the new definition.

線性的 · INFORMS · 貝葉斯統計 · 統計量 · 振蕩 ·

2024 年 3 月 6 日

A Categorical Treatment of Open Linear Systems

Dario Stein,Richard Samuelson

An open stochastic system \`a la Willems is a system affected two qualitatively different kinds of uncertainty -- one is probabilistic fluctuation, and the other one is nondeterminism caused by lack of information. We give a formalization of open stochastic systems in the language of category theory. A new construction, which we term copartiality, is needed to model the propagating lack of information (which corresponds to varying sigma-algebras). As a concrete example, we discuss extended Gaussian distributions, which combine Gaussian probability with nondeterminism and correspond precisely to Willems' notion of Gaussian linear systems. We describe them both as measure-theoretic and abstract categorical entities, which enables us to rigorously describe a variety of phenomena like noisy physical laws and uninformative priors in Bayesian statistics. The category of extended Gaussian maps can be seen as a mutual generalization of Gaussian probability and linear relations, which connects the literature on categorical probability with ideas from control theory like signal-flow diagrams.

CASE · Readability · SimPLe · 情景 · 表示 ·

2024 年 3 月 6 日

Realizability of Rectangular Euler Diagrams

Dominik Dürrschnabel,Uta Priss

from arxiv, 16 pages, 5 figures, 2 algorithms

Euler diagrams are a tool for the graphical representation of set relations. Due to their simple way of visualizing elements in the sets by geometric containment, they are easily readable by an inexperienced reader. Euler diagrams where the sets are visualized as aligned rectangles are of special interest. In this work, we link the existence of such rectangular Euler diagrams to the order dimension of an associated order relation. For this, we consider Euler diagrams in one and two dimensions. In the one-dimensional case, this correspondence provides us with a polynomial-time algorithm to compute the Euler diagrams, while the two-dimensional case results in an exponential-time algorithm.

話題模型 · 話題 · MoDELS · Analysis · state-of-the-art ·

2024 年 3 月 6 日

The Geometric Structure of Topic Models

Johannes Hirth,Tom Hanika

Topic models are a popular tool for clustering and analyzing textual data. They allow texts to be classified on the basis of their affiliation to the previously calculated topics. Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic. State-of-the-art methods for interpreting topic models are based on simple visualizations, such as similarity matrices, top-term lists or embeddings, which are limited to a maximum of three dimensions. In this paper, we propose an incidence-geometric method for deriving an ordinal structure from flat topic models, such as non-negative matrix factorization. These enable the analysis of the topic model in a higher (order) dimension and the possibility of extracting conceptual relationships between several topics at once. Due to the use of conceptual scaling, our approach does not introduce any artificial topical relationships, such as artifacts of feature compression. Based on our findings, we present a new visualization paradigm for concept hierarchies based on ordinal motifs. These allow for a top-down view on topic spaces. We introduce and demonstrate the applicability of our approach based on a topic model derived from a corpus of scientific papers taken from 32 top machine learning venues.

簇 · 歐氏空間 · Extensibility · 代價函數 · 相互獨立的 ·

2024 年 3 月 6 日

Space Complexity of Euclidean Clustering

Xiaoyi Zhu,Yuxiang Tian,Lingxiao Huang,Zengfeng Huang

from arxiv, Accepted by SoCG2024

The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

情景 · 歐氏空間 · 極小點 · 張成子空間 · Weight ·

2024 年 3 月 5 日

Maintaining Light Spanners via Minimal Updates

Hadi Khodabandeh,David Eppstein

We study the problem of maintaining a lightweight bounded-degree $(1+\varepsilon)$-spanner of a dynamic point set in a $d$-dimensional Euclidean space, where $\varepsilon>0$ and $d$ are arbitrary constants. In our fully-dynamic setting, points are allowed to be inserted as well as deleted, and our objective is to maintain a $(1+\varepsilon)$-spanner that has constant bounds on its maximum degree and its lightness (the ratio of its weight to that of the minimum spanning tree), while minimizing the recourse, which is the number of edges added or removed by each point insertion or deletion. We present a fully-dynamic algorithm that handles point insertion with amortized constant recourse and point deletion with amortized $O(\log\Delta)$ recourse, where $\Delta$ is the aspect ratio of the point set.

語言模型化 · MoDELS · 泛化理論 · 可辨認的 · Continuity ·

2023 年 7 月 12 日

A Comprehensive Overview of Large Language Models

Humza Naveed,Asad Ullah Khan,Shi Qiu,Muhammad Saqib,Saeed Anwar,Muhammad Usman,Nick Barnes,Ajmal Mian

Large Language Models (LLMs) have shown excellent generalization capabilities that have led to the development of numerous models. These models propose various new architectures, tweaking existing architectures with refined training strategies, increasing context length, using high-quality training data, and increasing training time to outperform baselines. Analyzing new developments is crucial for identifying changes that enhance training stability and improve generalization in LLMs. This survey paper comprehensively analyses the LLMs architectures and their categorization, training strategies, training datasets, and performance evaluations and discusses future research directions. Moreover, the paper also discusses the basic building blocks and concepts behind LLMs, followed by a complete overview of LLMs, including their important features and functions. Finally, the paper summarizes significant findings from LLM research and consolidates essential architectural and training strategies for developing advanced LLMs. Given the continuous advancements in LLMs, we intend to regularly update this paper by incorporating new sections and featuring the latest LLM models.

Performer · 學成 · 維數災難 · 泛化理論 · 數學 ·

2021 年 5 月 9 日

The Modern Mathematics of Deep Learning

Julius Berner,Philipp Grohs,Gitta Kutyniok,Philipp Petersen

from arxiv, This review paper will appear as a book chapter in the book "Theory of Deep Learning" by Cambridge University Press

We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.