We study the problem of identifying the source of a stochastic diffusion process spreading on a graph based on the arrival times of the diffusion at a few queried nodes. In a graph $G=(V,E)$, an unknown source node $v^* \in V$ is drawn uniformly at random, and unknown edge weights $w(e)$ for $e\in E$, representing the propagation delays along the edges, are drawn independently from a Gaussian distribution of mean $1$ and variance $\sigma^2$. An algorithm then attempts to identify $v^*$ by querying nodes $q \in V$ and being told the length of the shortest path between $q$ and $v^*$ in graph $G$ weighted by $w$. We consider two settings: non-adaptive, in which all query nodes must be decided in advance, and adaptive, in which each query can depend on the results of the previous ones. Both settings are motivated by an application of the problem to epidemic processes (where the source is called patient zero), which we discuss in detail. We characterize the query complexity when $G$ is an $n$-node path. In the non-adaptive setting, $\Theta(n\sigma^2)$ queries are needed for $\sigma^2 \leq 1$, and $\Theta(n)$ for $\sigma^2 \geq 1$. In the adaptive setting, somewhat surprisingly, only $\Theta(\log\log_{1/\sigma}n)$ are needed when $\sigma^2 \leq 1/2$, and $\Theta(\log \log n)+O_\sigma(1)$ when $\sigma^2 \geq 1/2$. This is the first mathematical study of source identification with time queries in a non-deterministic diffusion process.
Let $\sigma$ be a first-order signature and let $\mathbf{W}_n$ be the set of all $\sigma$-structures with domain $[n] = \{1, \ldots, n\}$. We can think of each structure in $\mathbf{W}_n$ as representing a "possible (state of the) world". By an inference framework we mean a class $\mathbf{F}$ of pairs $(\mathbb{P}, L)$, where $\mathbb{P} = (\mathbb{P}_n : n = 1, 2, 3, \ldots)$ and each $\mathbb{P}_n$ is a probability distribution on $\mathbb{W}_n$, and $L$ is a logic with truth values in the unit interval $[0, 1]$. From the point of view of probabilistic and logical expressivity one may consider an inference framework as optimal if it allows any pair $(\mathbb{P}, L)$ where $\mathbb{P} = (\mathbb{P}_n : n = 1, 2, 3, \ldots)$ is a sequence of probability distributions on $\mathbb{W}_n$ and $L$ is a logic. But from the point of view of using a pair $(\mathbb{P}, L)$ from such an inference framework for making inferences on $\mathbb{W}_n$ when $n$ is large we face the problem of computational complexity. This motivates looking for an "optimal" trade-off (in a given context) between expressivity and computational efficiency. We define a notion that an inference framework is "asymptotically at least as expressive" as another inference framework. This relation is a preorder and we describe a (strict) partial order on the equivalence classes of some inference frameworks that in our opinion are natural in the context of machine learning and artificial intelligence. The results have bearing on issues concerning efficient learning and probabilistic inference, but are also new instances of results in finite model theory about "almost sure elimination" of extra syntactic features (e.g quantifiers) beyond the connectives. Often such a result has a logical convergence law as a corollary.
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the `antipatterns' of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification.
We consider the question of adaptive data analysis within the framework of convex optimization. We ask how many samples are needed in order to compute $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients queried by gradient descent, and we provide two intermediate answers to this question. First, we show that for a general analyst (not necessarily gradient descent) $\Omega(1/\epsilon^3)$ samples are required. This rules out the possibility of a foolproof mechanism. Our construction builds upon a new lower bound (that may be of interest of its own right) for an analyst that may ask several non adaptive questions in a batch of fixed and known $T$ rounds of adaptivity and requires a fraction of true discoveries. We show that for such an analyst $\Omega (\sqrt{T}/\epsilon^2)$ samples are necessary. Second, we show that, under certain assumptions on the oracle, in an interaction with gradient descent $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary. Our assumptions are that the oracle has only \emph{first order access} and is \emph{post-hoc generalizing}. First order access means that it can only compute the gradients of the sampled function at points queried by the algorithm. Our assumption of \emph{post-hoc generalization} follows from existing lower bounds for statistical queries. More generally then, we provide a generic reduction from the standard setting of statistical queries to the problem of estimating gradients queried by gradient descent. These results are in contrast with classical bounds that show that with $O(1/\epsilon^2)$ samples one can optimize the population risk to accuracy of $O(\epsilon)$ but, as it turns out, with spurious gradients.
We study the problem of testing whether a function $f: \mathbb{R}^n \to \mathbb{R}$ is a polynomial of degree at most $d$ in the \emph{distribution-free} testing model. Here, the distance between functions is measured with respect to an unknown distribution $\mathcal{D}$ over $\mathbb{R}^n$ from which we can draw samples. In contrast to previous work, we do not assume that $\mathcal{D}$ has finite support. We design a tester that given query access to $f$, and sample access to $\mathcal{D}$, makes $(d/\varepsilon)^{O(1)}$ many queries to $f$, accepts with probability $1$ if $f$ is a polynomial of degree $d$, and rejects with probability at least $2/3$ if every degree-$d$ polynomial $P$ disagrees with $f$ on a set of mass at least $\varepsilon$ with respect to $\mathcal{D}$. Our result also holds under mild assumptions when we receive only a polynomial number of bits of precision for each query to $f$, or when $f$ can only be queried on rational points representable using a logarithmic number of bits. Along the way, we prove a new stability theorem for multivariate polynomials that may be of independent interest.
We provide a decision theoretic analysis of bandit experiments. The setting corresponds to a dynamic programming problem, but solving this directly is typically infeasible. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for bandit experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these equations. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework also allows for substantial generalizations to the bandit problem such as time discounting and pure exploration motives.
We study dynamic algorithms for the problem of maximizing a monotone submodular function over a stream of $n$ insertions and deletions. We show that any algorithm that maintains a $(0.5+\epsilon)$-approximate solution under a cardinality constraint, for any constant $\epsilon>0$, must have an amortized query complexity that is $\mathit{polynomial}$ in $n$. Moreover, a linear amortized query complexity is needed in order to maintain a $0.584$-approximate solution. This is in sharp contrast with recent dynamic algorithms of [LMNF+20, Mon20] that achieve $(0.5-\epsilon)$-approximation with a $\mathsf{poly}\log(n)$ amortized query complexity. On the positive side, when the stream is insertion-only, we present efficient algorithms for the problem under a cardinality constraint and under a matroid constraint with approximation guarantee $1-1/e-\epsilon$ and amortized query complexities $\smash{O(\log (k/\epsilon)/\epsilon^2)}$ and $\smash{k^{\tilde{O}(1/\epsilon^2)}\log n}$, respectively, where $k$ denotes the cardinality parameter or the rank of the matroid.
Crowd-sourcing is a powerful solution for finding correct answers to expensive and unanswered queries in databases, including those with uncertain and incomplete data. Attempts to use crowd-sourcing to exploit human abilities to process these expensive queries using human workers have helped to provide accurate results by utilising the available data in the crowd. Crowd-sourcing database systems (CSDBs) combine the knowledge of the crowd with a relational database by using some variant of a relational database with minor changes. This paper surveys the leading studies conducted in the area of query processing with regard to both traditional and preference queries in CSDBs. The focus of this work is on highlighting the strengths and the weakness of each approach. A detailed discussion of current and future trends research associated with query processing in the area of CSDBs is also presented.
The minimum energy path (MEP) describes the mechanism of reaction, and the energy barrier along the path can be used to calculate the reaction rate in thermal systems. The nudged elastic band (NEB) method is one of the most commonly used schemes to compute MEPs numerically. It approximates an MEP by a discrete set of configuration images, where the discretization size determines both computational cost and accuracy of the simulations. In this paper, we consider a discrete MEP to be a stationary state of the NEB method and prove an optimal convergence rate of the discrete MEP with respect to the number of images. Numerical simulations for the transitions of some several proto-typical model systems are performed to support the theory.
A palindromic substring $T[i.. j]$ of a string $T$ is said to be a shortest unique palindromic substring (SUPS) in $T$ for an interval $[p, q]$ if $T[i.. j]$ is a shortest one such that $T[i.. j]$ occurs only once in $T$, and $[i, j]$ contains $[p, q]$. The SUPS problem is, given a string $T$ of length $n$, to construct a data structure that can compute all the SUPSs for any given query interval. It is known that any SUPS query can be answered in $O(\alpha)$ time after $O(n)$-time preprocessing, where $\alpha$ is the number of SUPSs to output [Inoue et al., 2018]. In this paper, we first show that $\alpha$ is at most $4$, and the upper bound is tight. Also, we present an algorithm to solve the SUPS problem for a sliding window that can answer any query in $O(\log\log W)$ time and update data structures in amortized $O(\log\sigma)$ time, where $W$ is the size of the window, and $\sigma$ is the alphabet size. Furthermore, we consider the SUPS problem in the after-edit model and present an efficient algorithm. Namely, we present an algorithm that uses $O(n)$ time for preprocessing and answers any $k$ SUPS queries in $O(\log n\log\log n + k\log\log n)$ time after single character substitution. As a by-product, we propose a fully-dynamic data structure for range minimum queries (RmQs) with a constraint where the width of each query range is limited to polylogarithmic. The constrained RmQ data structure can answer such a query in constant time and support a single-element edit operation in amortized constant time.
Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.