Suppose we observe a Poisson process in real time for which the intensity may take on two possible values $\lambda_0$ and $\lambda_1$. Suppose further that the priori probability of the true intensity is not given. We solve a minimax version of Bayesian problem of sequential testing of two simple hypotheses to minimize a linear combination of the probability of wrong detection and the expected waiting time in the worst scenario of all possible priori distributions. An equivalent characterization for the least favorable distributions is derived and a sufficient condition for the existence is concluded.
We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
For a permutation $\pi: [K]\rightarrow [K]$, a sequence $f: \{1,2,\cdots, n\}\rightarrow \mathbb R$ contains a $\pi$-pattern of size $K$, if there is a sequence of indices $(i_1, i_2, \cdots, i_K)$ ($i_1<i_2<\cdots<i_K$), satisfying that $f(i_a)<f(i_b)$ if $\pi(a)<\pi(b)$, for $a,b\in [K]$. Otherwise, $f$ is referred to as $\pi$-free. For the special case where $\pi = (1,2,\cdots, K)$, it is referred to as the monotone pattern. \cite{newman2017testing} initiated the study of testing $\pi$-freeness with one-sided error. They focused on two specific problems, testing the monotone permutations and the $(1,3,2)$ permutation. For the problem of testing monotone permutation $(1,2,\cdots,K)$, \cite{ben2019finding} improved the $(\log n)^{O(K^2)}$ non-adaptive query complexity of \cite{newman2017testing} to $O((\log n)^{\lfloor \log_{2} K\rfloor})$. Further, \cite{ben2019optimal} proposed an adaptive algorithm with $O(\log n)$ query complexity. However, no progress has yet been made on the problem of testing $(1,3,2)$-freeness. In this work, we present an adaptive algorithm for testing $(1,3,2)$-freeness. The query complexity of our algorithm is $O(\epsilon^{-2}\log^4 n)$, which significantly improves over the $O(\epsilon^{-7}\log^{26}n)$-query adaptive algorithm of \cite{newman2017testing}. This improvement is mainly achieved by the proposal of a new structure embedded in the patterns.
The access lemma (Sleator and Tarjan, JACM 1985) is a property of binary search trees that implies interesting consequences such as static optimality, static finger, and working set property. However, there are known corollaries of the dynamic optimality that cannot be derived via the access lemma, such as the dynamic finger, and any $o(\log n)$-competitive ratio to the optimal BST where $n$ is the number of keys. In this paper, we introduce the group access bound that can be defined with respect to a reference group access tree. Group access bounds generalize the access lemma and imply properties that are far stronger than those implied by the access lemma. For each of the following results, there is a group access tree whose group access bound Is $O(\sqrt{\log n})$-competitive to the optimal BST. Achieves the $k$-finger bound with an additive term of $O(m \log k \log \log n)$ (randomized) when the reference tree is an almost complete binary tree. Satisfies the unified bound with an additive term of $O(m \log \log n)$. Matches the unified bound with a time window $k$ with an additive term of $O(m \log k \log \log n)$ (randomized). Furthermore, we prove simulation theorem: For every group access tree, there is an online BST algorithm that is $O(1)$-competitive with its group access bound. In particular, any new group access bound will automatically imply a new BST algorithm achieving the same bound. Thereby, we obtain an improved $k$-finger bound (reference tree is an almost complete binary tree), an improved unified bound with a time window $k$, and matching the best-known bound for Unified bound in the BST model. Since any dynamically optimal BST must achieve the group access bounds, we believe our results provide a new direction towards proving $o(\log n)$-competitiveness of Splay tree and Greedy.
We show an $O(n)$-time reduction from the problem of testing whether a multiset of positive integers can be partitioned into two multisets so that the sum of the integers in each multiset is equal to $n/2$ to the problem of testing whether an $n$-vertex biconnected outerplanar DAG admits an upward planar drawing. This constitutes the first barrier to the existence of efficient algorithms for testing the upward planarity of DAGs with no large triconnected minor. We also show a result in the opposite direction. Suppose that partitioning a multiset of positive integers into two multisets so that the sum of the integers in each multiset is $n/2$ can be solved in $f(n)$ time. Let $G$ be an $n$-vertex biconnected outerplanar DAG and $e$ be an edge incident to the outer face of an outerplanar drawing of $G$. Then it can be tested in $O(f(n))$ time whether $G$ admits an upward planar drawing with $e$ on the outer face.
In this article, we present a construction of a spanner on a set of $n$ points in $\mathbf{R}^d$ that we call a heavy path WSPD spanner. The construction is parameterized by a constant $s > 2$ called the separation ratio. The size of the graph is $O(s^dn)$ and the spanning ratio is at most $1 + 2/s + 2/(s - 1)$. We also show that this graph has a hop spanning ratio of at most $2\lg n + 1$. We present a memoryless local routing algorithm for heavy path WSPD spanners. The routing algorithm requires a vertex $v$ of the graph to store $O(\mathrm{deg}(v)\log n)$ bits of information, where $\mathrm{deg}(v)$ is the degree of $v$. The routing ratio is at most $1 + 4/s + 1/(s - 1)$ and at least $1 + 4/s$ in the worst case. The number of edges on the routing path is bounded by $2\lg n + 1$. We then show that the heavy path WSPD spanner can be constructed in metric spaces of bounded doubling dimension. These metric spaces have been studied in computational geometry as a generalization of Euclidean space. We show that, in a metric space with doubling dimension $\lambda$, the heavy path WSPD spanner has size $O(s^\lambda n)$ where $s$ is the separation ratio. The spanning ratio and hop spanning ratio are the same as in the Euclidean case. Finally, we show that the local routing algorithm works in the bounded doubling dimension case. The vertices require the same amount of storage, but the routing ratio becomes at most $1 + (2 + \frac{\tau}{\tau-1})/s + 1/(s - 1)$ in the worst case, where $\tau \ge 11$ is a constant related to the doubling dimension.
A pervasive methodological error is the post-hoc interpretation of $p$-values. A $p$-value $p$ is not the level at which we reject the null, it is the level at which we would have rejected the null had we chosen level $p$. We introduce post-hoc $p$-values, that do admit this interpretation. We show that $p$ is a post-hoc $p$-value if and only if $1/p$ is an $e$-value. This implies that the product of independent post-hoc $p$-values is a post-hoc $p$-value, making them easy to combine. If we permit external randomization, we find any non-randomized post-hoc $p$-value can be trivially improved. However, we find only (essentially) non-randomized post-hoc $p$-values can be arbitrarily merged through multiplication. Our results extend to post-hoc anytime validity in a sequential setting. Moreover, we introduce two-way post-hoc $p$-values, whose reciprocal is also post-hoc under the alternative. Likelihood ratios are two-way post-hoc $p$-values, which supports their 'direct' interpretation often purported in the context of Bayes factors and links their interpretation to post-hoc $p$-values. Finally, we extend to geometric post-hoc validity and show that GRO $e$-values are the reciprocal of post-hoc $p$-values that minimize the geometric post-hoc error under the alternative.
We study the problem of contextual feature selection, where the goal is to learn a predictive function while identifying subsets of informative features conditioned on specific contexts. Towards this goal, we generalize the recently proposed stochastic gates (STG) Yamada et al. [2020] by modeling the probabilistic gates as conditional Bernoulli variables whose parameters are predicted based on the contextual variables. Our new scheme, termed conditional-STG (c-STG), comprises two networks: a hypernetwork that establishes the mapping between contextual variables and probabilistic feature selection parameters and a prediction network that maps the selected feature to the response variable. Training the two networks simultaneously ensures the comprehensive incorporation of context and feature selection within a unified model. We provide a theoretical analysis to examine several properties of the proposed framework. Importantly, our model leads to improved flexibility and adaptability of feature selection and, therefore, can better capture the nuances and variations in the data. We apply c-STG to simulated and real-world datasets, including healthcare, housing, and neuroscience, and demonstrate that it effectively selects contextually meaningful features, thereby enhancing predictive performance and interpretability.
Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. This work focuses on the task of building theoretically principled generative models for such data. Current generative models learn $\mathcal{M}$ by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. These procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. We then learn the probability density within $\mathcal{M}$ with a constrained energy-based model, which employs a constrained variant of Langevin dynamics to train and sample from the learned manifold. In experiments on synthetic and natural data, we show that our model can learn manifold-supported distributions with complex topologies more accurately than pushforward models.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.