We consider the linear causal representation learning setting where we observe a linear mixing of $d$ unknown latent factors, which follow a linear structural causal model. Recent work has shown that it is possible to recover the latent factors as well as the underlying structural causal model over them, up to permutation and scaling, provided that we have at least $d$ environments, each of which corresponds to perfect interventions on a single latent node (factor). After this powerful result, a key open problem faced by the community has been to relax these conditions: allow for coarser than perfect single-node interventions, and allow for fewer than $d$ of them, since the number of latent factors $d$ could be very large. In this work, we consider precisely such a setting, where we allow a smaller than $d$ number of environments, and also allow for very coarse interventions that can very coarsely \textit{change the entire causal graph over the latent factors}. On the flip side, we relax what we wish to extract to simply the \textit{list of nodes that have shifted between one or more environments}. We provide a surprising identifiability result that it is indeed possible, under some very mild standard assumptions, to identify the set of shifted nodes. Our identifiability proof moreover is a constructive one: we explicitly provide necessary and sufficient conditions for a node to be a shifted node, and show that we can check these conditions given observed data. Our algorithm lends itself very naturally to the sample setting where instead of just interventional distributions, we are provided datasets of samples from each of these distributions. We corroborate our results on both synthetic experiments as well as an interesting psychometric dataset. The code can be found at //github.com/TianyuCodings/iLCS.
Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.
Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constraint of each latent variable, omitting the structural information between latent variables. In this paper, we propose a novel structural entropy-guided probabilistic coding model, named SEPC. Specifically, we incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Besides, as traditional structural information theory is not well-suited for regression tasks, we propose a probabilistic encoding tree, transferring regression tasks to classification tasks while diminishing the influence of the transformation. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise. The codes and datasets are available at //github.com/SELGroup/SEPC.
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level $\zeta$. At the core of Cert-LSVI-UCB is an innovative \method, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. Specifically, we demonstrate that for a linear MDP characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high probability, provided that the misspecification level $\zeta$ is below $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Here $d$ is the dimension of the feature space and $H$ is the horizon. Remarkably, this regret bound is independent of the number of episodes $K$. To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation without relying on prior distribution assumptions.
High-dimensional problems have long been considered the Achilles' heel of Bayesian optimization algorithms. Spurred by the curse of dimensionality, a large collection of algorithms aim to make it more performant in this setting, commonly by imposing various simplifying assumptions on the objective. In this paper, we identify the degeneracies that make vanilla Bayesian optimization poorly suited to high-dimensional tasks, and further show how existing algorithms address these degeneracies through the lens of lowering the model complexity. Moreover, we propose an enhancement to the prior assumptions that are typical to vanilla Bayesian optimization algorithms, which reduces the complexity to manageable levels without imposing structural restrictions on the objective. Our modification - a simple scaling of the Gaussian process lengthscale prior with the dimensionality - reveals that standard Bayesian optimization works drastically better than previously thought in high dimensions, clearly outperforming existing state-of-the-art algorithms on multiple commonly considered real-world high-dimensional tasks.
We consider temporal numeric planning problems $\Pi$ expressed in PDDL2.1 level 3, and show how to produce SMT formulas $(i)$ whose models correspond to valid plans of $\Pi$, and $(ii)$ that extend the recently proposed planning with patterns approach from the numeric to the temporal case. We prove the correctness and completeness of the approach and show that it performs very well on 10 domains with required concurrency.
The paper "Sorting with Bialgebras and Distributive Laws" by Hinze et al. uses the framework of bialgebraic semantics to define sorting algorithms. From distributive laws between functors they construct pairs of sorting algorithms using both folds and unfolds. Pairs of sorting algorithms arising this way include insertion/selection sort and quick/tree sort. We extend this work to define intrinsically correct variants in cubical Agda. Our key idea is to index our data types by multisets, which concisely captures that a sorting algorithm terminates with an ordered permutation of its input list. By lifting bialgebraic semantics to the indexed setting, we obtain the correctness of sorting algorithms purely from the distributive law.
Not many tests exist for testing the equality for two or more multivariate distributions with compositional data, perhaps due to their constrained sample space. At the moment, there is only one test suggested that relies upon random projections. We propose a novel test termed {\alpha}-Energy Based Test ({\alpha}-EBT) to compare the multivariate distributions of two (or more) compositional data sets. Similar to the aforementioned test, the new test makes no parametric assumptions about the data and, based on simulation studies it exhibits higher power levels.
The Knaster-Tarski theorem, also known as Tarski's theorem, guarantees that every monotone function defined on a complete lattice has a fixed point. We analyze the query complexity of finding such a fixed point on the $k$-dimensional grid of side length $n$ under the $\leq$ relation. Specifically, there is an unknown monotone function $f: \{0,1,\ldots, n-1\}^k \to \{0,1,\ldots, n-1\}^k$ and an algorithm must query a vertex $v$ to learn $f(v)$. A key special case of interest is the Boolean hypercube $\{0,1\}^k$, which is isomorphic to the power set lattice -- the original setting of the Knaster-Tarski theorem. Our lower bound characterizes the randomized and deterministic query complexity of the Tarski search problem on the Boolean hypercube as $\Theta(k)$. More generally, we prove a randomized lower bound of $\Omega\left( k + \frac{k \cdot \log{n}}{\log{k}} \right)$ for the $k$-dimensional grid of side length $n$, which is asymptotically tight in high dimensions when $k$ is large relative to $n$.
Data augmentation has been widely used to improve generalizability of machine learning models. However, comparatively little work studies data augmentation for graphs. This is largely due to the complex, non-Euclidean structure of graphs, which limits possible manipulation operations. Augmentation operations commonly used in vision and language have no analogs for graphs. Our work studies graph data augmentation for graph neural networks (GNNs) in the context of improving semi-supervised node-classification. We discuss practical and theoretical motivations, considerations and strategies for graph data augmentation. Our work shows that neural edge predictors can effectively encode class-homophilic structure to promote intra-class edges and demote inter-class edges in given graph structure, and our main contribution introduces the GAug graph data augmentation framework, which leverages these insights to improve performance in GNN-based node classification via edge prediction. Extensive experiments on multiple benchmarks show that augmentation via GAug improves performance across GNN architectures and datasets.
We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.