We introduce a random recursive tree model with two communities, called balanced community modulated random recursive tree, or BCMRT in short. In this setting, pairs of nodes of different type appear sequentially. Each one of them decides independently to attach to their own type with probability 1-q, or to the other type with probability q, and then chooses its parent uniformly within the set of existing nodes with the selected type. We find that the limiting degree distributions coincide for different q. Therefore, as far as inference is concerned, other statistics have to be studied. We first consider the setting where the time-labels of the nodes, i.e. their time of arrival, are observed but their type is not. In this setting, we design a consistent estimator for q and provide bounds for the feasibility of testing between two different values of q. Moreover, we show that if q is small enough, then it is possible to cluster in a way correlated with the true partition, even though the algorithm is exponential in time. In the unlabelled setting, i.e. when only the tree structure is observed, we show that it is possible to test between different values of q in a strictly better way than by random guessing. This follows from a delicate analysis of the sum-of-distances statistic.
We present a novel method to compute $\textit{assume-guarantee contracts}$ in non-zerosum two-player games over finite graphs where each player has a different $ \omega $-regular winning condition. Given a game graph $G$ and two parity winning conditions $\Phi_0$ and $\Phi_1$ over $G$, we compute $\textit{contracted strategy-masks}$ ($\texttt{csm}$) $(\Psi_{i},\Phi_{i})$ for each Player $i$. Within a $\texttt{csm}$, $\Phi_{i}$ is a $\textit{permissive strategy template}$ which collects an infinite number of winning strategies for Player $i$ under the assumption that Player $1-i$ chooses any strategy from the $\textit{permissive assumption template}$ $\Psi_{i}$. The main feature of $\texttt{csm}$'s is their power to $\textit{fully decentralize all remaining strategy choices}$ -- if the two player's $\texttt{csm}$'s are compatible, they provide a pair of new local specifications $\Phi_0^\bullet$ and $\Phi_1^\bullet$ such that Player $i$ can locally and fully independently choose any strategy satisfying $\Phi_i^\bullet$ and the resulting strategy profile is ensured to be winning in the original two-objective game $(G,\Phi_0,\Phi_1)$. In addition, the new specifications $\Phi_i^\bullet$ are $\textit{maximally cooperative}$, i.e., allow for the distributed synthesis of any cooperative solution. Further, our algorithmic computation of $\texttt{csm}$'s is complete and ensured to terminate. We illustrate how the unique features of our synthesis framework effectively address multiple challenges in the context of \enquote{correct-by-design} logical control software synthesis for cyber-physical systems and provide empirical evidence that our approach possess desirable structural and computational properties compared to state-of-the-art techniques.
We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.
Linear State Space Models (SSMs) have demonstrated strong performance in a variety of sequence modeling tasks due to their efficient encoding of the recurrent structure. However, in more comprehensive tasks like language modeling and machine translation, self-attention-based models still outperform SSMs. Hybrid models employing both SSM and self-attention generally show promising performance, but current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. In this work, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption at both training and inference stages of sequence modeling. As a specific instantiation of SMA, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including language modeling, speech classification and long-range arena, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity and reveals the amount of attention needed for each task through the learned sparse activation patterns.
We provide a rigorous analysis of training by variational inference (VI) of Bayesian neural networks in the two-layer and infinite-width case. We consider a regression problem with a regularized evidence lower bound (ELBO) which is decomposed into the expected log-likelihood of the data and the Kullback-Leibler (KL) divergence between the a priori distribution and the variational posterior. With an appropriate weighting of the KL, we prove a law of large numbers for three different training schemes: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes by Backprop, and (iii) a new and computationally cheaper algorithm which we introduce as Minimal VI. An important result is that all methods converge to the same mean-field limit. Finally, we illustrate our results numerically and discuss the need for the derivation of a central limit theorem.
We consider the classic 1-center problem: Given a set $P$ of $n$ points in a metric space find the point in $P$ that minimizes the maximum distance to the other points of $P$. We study the complexity of this problem in $d$-dimensional $\ell_p$-metrics and in edit and Ulam metrics over strings of length $d$. Our results for the 1-center problem may be classified based on $d$ as follows. $\bullet$ Small $d$: Assuming the hitting set conjecture (HSC), we show that when $d=\omega(\log n)$, no subquadratic algorithm can solve 1-center problem in any of the $\ell_p$-metrics, or in edit or Ulam metrics. $\bullet$ Large $d$: When $d=\Omega(n)$, we extend our conditional lower bound to rule out subquartic algorithms for 1-center problem in edit metric (assuming Quantified SETH). On the other hand, we give a $(1+\epsilon)$-approximation for 1-center in Ulam metric with running time $\tilde{O_{\varepsilon}}(nd+n^2\sqrt{d})$. We also strengthen some of the above lower bounds by allowing approximations or by reducing the dimension $d$, but only against a weaker class of algorithms which list all requisite solutions. Moreover, we extend one of our hardness results to rule out subquartic algorithms for the well-studied 1-median problem in the edit metric, where given a set of $n$ strings each of length $n$, the goal is to find a string in the set that minimizes the sum of the edit distances to the rest of the strings in the set.
We prove that it is NP-hard to properly PAC learn decision trees with queries, resolving a longstanding open problem in learning theory (Bshouty 1993; Guijarro-Lavin-Raghavan 1999; Mehta-Raghavan 2002; Feldman 2016). While there has been a long line of work, dating back to (Pitt-Valiant 1988), establishing the hardness of properly learning decision trees from random examples, the more challenging setting of query learners necessitates different techniques and there were no previous lower bounds. En route to our main result, we simplify and strengthen the best known lower bounds for a different problem of Decision Tree Minimization (Zantema-Bodlaender 2000; Sieling 2003). On a technical level, we introduce the notion of hardness distillation, which we study for decision tree complexity but can be considered for any complexity measure: for a function that requires large decision trees, we give a general method for identifying a small set of inputs that is responsible for its complexity. Our technique even rules out query learners that are allowed constant error. This contrasts with existing lower bounds for the setting of random examples which only hold for inverse-polynomial error. Our result, taken together with a recent almost-polynomial time query algorithm for properly learning decision trees under the uniform distribution (Blanc-Lange-Qiao-Tan 2022), demonstrates the dramatic impact of distributional assumptions on the problem.
A canonical problem in social choice is how to aggregate ranked votes: given $n$ voters' rankings over $m$ candidates, what voting rule $f$ should we use to aggregate these votes into a single winner? One standard method for comparing voting rules is by their satisfaction of axioms - properties that we want a "reasonable" rule to satisfy. Unfortunately, this approach leads to several impossibilities: no voting rule can simultaneously satisfy all the properties we want, at least in the worst case over all possible inputs. Motivated by this, we consider a relaxation of these worst case requirements. We do so using a "smoothed" model of social choice, where votes are perturbed with small amounts of noise. If, no matter which input profile we start with, the probability (post-noise) of an axiom being satisfied is large, we will consider the axiom as good as satisfied - called "smoothed-satisfied" - even if it may be violated in the worst case. Our model is a mild restriction of Lirong Xia's, and corresponds closely to that in Spielman and Teng's original work on smoothed analysis. Much work has been done so far in several papers by Xia on axiom satisfaction under such noise. In our paper, we aim to give a more cohesive overview on when smoothed analysis of social choice is useful. Within our model, we give simple sufficient conditions for smoothed-satisfaction or smoothed-violation of several previously-unstudied axioms and paradoxes, plus many of those studied by Xia. We then observe that, in a practically important subclass of noise models, although convergence eventually occurs, known rates may require an extremely large number of voters. Motivated by this, we prove bounds specifically within a canonical noise model from this subclass - the Mallows model. Here, we present a more nuanced picture on exactly when smoothed analysis can help.
We consider the problem of uncertainty quantification in change point regressions, where the signal can be piecewise polynomial of arbitrary but fixed degree. That is we seek disjoint intervals which, uniformly at a given confidence level, must each contain a change point location. We propose a procedure based on performing local tests at a number of scales and locations on a sparse grid, which adapts to the choice of grid in the sense that by choosing a sparser grid one explicitly pays a lower price for multiple testing. The procedure is fast as its computational complexity is always of the order $\mathcal{O} (n \log (n))$ where $n$ is the length of the data, and optimal in the sense that under certain mild conditions every change point is detected with high probability and the widths of the intervals returned match the mini-max localisation rates for the associated change point problem up to log factors. A detailed simulation study shows our procedure is competitive against state of the art algorithms for similar problems. Our procedure is implemented in the R package ChangePointInference which is available via //github.com/gaviosha/ChangePointInference.
Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility. The source code is available at \url{//github.com/GeWu-Lab/OGM-GE_CVPR2022}.
This paper focuses on two fundamental tasks of graph analysis: community detection and node representation learning, which capture the global and local structures of graphs, respectively. In the current literature, these two tasks are usually independently studied while they are actually highly correlated. We propose a probabilistic generative model called vGraph to learn community membership and node representation collaboratively. Specifically, we assume that each node can be represented as a mixture of communities, and each community is defined as a multinomial distribution over nodes. Both the mixing coefficients and the community distribution are parameterized by the low-dimensional representations of the nodes and communities. We designed an effective variational inference algorithm which regularizes the community membership of neighboring nodes to be similar in the latent space. Experimental results on multiple real-world graphs show that vGraph is very effective in both community detection and node representation learning, outperforming many competitive baselines in both tasks. We show that the framework of vGraph is quite flexible and can be easily extended to detect hierarchical communities.