We re-visit the complexity of kernelization for the $d$-Hitting Set problem. This is a classic problem in Parameterized Complexity, which encompasses several other of the most well-studied problems in this field, such as Vertex Cover, Feedback Vertex Set in Tournaments (FVST) and Cluster Vertex Deletion (CVD). In fact, $d$-Hitting Set encompasses any deletion problem to a hereditary property that can be characterized by a finite set of forbidden induced subgraphs. With respect to bit size, the kernelization complexity of $d$-Hitting Set is essentially settled: there exists a kernel with $O(k^d)$ bits ($O(k^d)$ sets and $O(k^{d-1})$ elements) and this it tight by the result of Dell and van Melkebeek [STOC 2010, JACM 2014]. Still, the question of whether there exists a kernel for $d$-Hitting Set with fewer elements has remained one of the most major open problems~in~Kernelization. In this paper, we first show that if we allow the kernelization to be lossy with a qualitatively better loss than the best possible approximation ratio of polynomial time approximation algorithms, then one can obtain kernels where the number of elements is linear for every fixed $d$. Further, based on this, we present our main result: we show that there exist approximate Turing kernelizations for $d$-Hitting Set that even beat the established bit-size lower bounds for exact kernelizations -- in fact, we use a constant number of oracle calls, each with ``near linear'' ($O(k^{1+\epsilon})$) bit size, that is, almost the best one could hope for. Lastly, for two special cases of implicit 3-Hitting set, namely, FVST and CVD, we obtain the ``best of both worlds'' type of results -- $(1+\epsilon)$-approximate kernelizations with a linear number of vertices. In terms of size, this substantially improves the exact kernels of Fomin et al. [SODA 2018, TALG 2019], with simpler arguments.
We present the Fast Chebyshev Transform (FCT), a fast, randomized algorithm to compute a Chebyshev approximation of functions in high-dimensions from the knowledge of the location of its nonzero Chebyshev coefficients. Rather than sampling a full-resolution Chebyshev grid in each dimension, we randomly sample several grids with varied resolutions and solve a least-squares problem in coefficient space in order to compute a polynomial approximating the function of interest across all grids simultaneously. We theoretically and empirically show that the FCT exhibits quasi-linear scaling and high numerical accuracy on challenging and complex high-dimensional problems. We demonstrate the effectiveness of our approach compared to alternative Chebyshev approximation schemes. In particular, we highlight our algorithm's effectiveness in high dimensions, demonstrating significant speedups over commonly-used alternative techniques.
We consider the problem of sampling from a distribution governed by a potential function. This work proposes an explicit score based MCMC method that is deterministic, resulting in a deterministic evolution for particles rather than a stochastic differential equation evolution. The score term is given in closed form by a regularized Wasserstein proximal, using a kernel convolution that is approximated by sampling. We demonstrate fast convergence on various problems and show improved dimensional dependence of mixing time bounds for the case of Gaussian distributions compared to the unadjusted Langevin algorithm (ULA) and the Metropolis-adjusted Langevin algorithm (MALA). We additionally derive closed form expressions for the distributions at each iterate for quadratic potential functions, characterizing the variance reduction. Empirical results demonstrate that the particles behave in an organized manner, lying on level set contours of the potential. Moreover, the posterior mean estimator of the proposed method is shown to be closer to the maximum a-posteriori estimator compared to ULA and MALA in the context of Bayesian logistic regression. Additional examples demonstrate competitive performance for Bayesian neural network training.
Contrastive learning often relies on comparing positive anchor samples with multiple negative samples to perform Self-Supervised Learning (SSL). However, non-contrastive approaches like BYOL, SimSiam, and Barlow Twins achieve SSL without explicit negative samples. In this paper, we introduce a unified matrix information-theoretic framework that explains many contrastive and non-contrastive learning methods. We then propose a novel method Matrix-SSL based on matrix information theory. Experimental results reveal that Matrix-SSL significantly outperforms state-of-the-art methods on the ImageNet dataset under linear evaluation settings and on MS-COCO for transfer learning tasks. Specifically, when performing 100 epochs pre-training, our method outperforms SimCLR by 4.6%, and when performing transfer learning tasks on MS-COCO, our method outperforms previous SOTA methods such as MoCo v2 and BYOL up to 3.3% with only 400 epochs compared to 800 epochs pre-training. Code available at //github.com/yifanzhang-pro/Matrix-SSL.
Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.
Bayesian optimization (BO) is a popular black-box function optimization method, which makes sequential decisions based on a Bayesian model, typically a Gaussian process (GP), of the function. To ensure the quality of the model, transfer learning approaches have been developed to automatically design GP priors by learning from observations on "training" functions. These training functions are typically required to have the same domain as the "test" function (black-box function to be optimized). In this paper, we introduce MPHD, a model pre-training method on heterogeneous domains, which uses a neural net mapping from domain-specific contexts to specifications of hierarchical GPs. MPHD can be seamlessly integrated with BO to transfer knowledge across heterogeneous search spaces. Our theoretical and empirical results demonstrate the validity of MPHD and its superior performance on challenging black-box function optimization tasks.
The Bayesian Cram\'er-Rao bound (CRB) provides a lower bound on the error of any Bayesian estimator under mild regularity conditions. It can be used to benchmark the performance of estimators, and provides a principled design metric for guiding system design and optimization. However, the Bayesian CRB depends on the prior distribution, which is often unknown for many problems of interest. This work develops a new data-driven estimator for the Bayesian CRB using score matching, a statistical estimation technique, to model the prior distribution. The performance of the estimator is analyzed in both the classical parametric modeling regime and the neural network modeling regime. In both settings, we develop novel non-asymptotic bounds on the score matching error and our Bayesian CRB estimator. Our proofs build on results from empirical process theory, including classical bounds and recently introduced techniques for characterizing neural networks, to address the challenges of bounding the score matching error. The performance of the estimator is illustrated empirically on a denoising problem example with a Gaussian mixture prior.
Designing models that are both expressive and preserve known invariances of tasks is an increasingly hard problem. Existing solutions tradeoff invariance for computational or memory resources. In this work, we show how to leverage randomness and design models that are both expressive and invariant but use less resources. Inspired by randomized algorithms, our key insight is that accepting probabilistic notions of universal approximation and invariance can reduce our resource requirements. More specifically, we propose a class of binary classification models called Randomized Linear Classifiers (RLCs). We give parameter and sample size conditions in which RLCs can, with high probability, approximate any (smooth) function while preserving invariance to compact group transformations. Leveraging this result, we design three RLCs that are provably probabilistic invariant for classification tasks over sets, graphs, and spherical data. We show how these models can achieve probabilistic invariance and universality using less resources than (deterministic) neural networks and their invariant counterparts. Finally, we empirically demonstrate the benefits of this new class of models on invariant tasks where deterministic invariant neural networks are known to struggle.
Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs. Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties, i.e., entities may exhibit diverse roles within task relations, and references may make different contributions to queries. This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations. Specifically, entities are modeled by an adaptive neighbor encoder to discern their task-oriented roles, while references are modeled by an adaptive query-aware aggregator to differentiate their contributions. Through the attention mechanism, both entities and references can capture their fine-grained semantic meanings, and thus render more expressive representations. This will be more predictive for knowledge acquisition in the few-shot scenario. Evaluation in link prediction on two public datasets shows that our approach achieves new state-of-the-art results with different few-shot sizes.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis, thereby allowing manual manipulation in predicting the final answer.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.