The purpose of this paper is to present a {\em fresh} idea on how symbolic learning might be realized via analogical reasoning. For this, we introduce directed analogical proportions between logic programs of the form "$P$ transforms into $Q$ as $R$ transforms into $S$" as a mechanism for deriving similar programs by analogy-making. The idea is to instantiate a fragment of a recently introduced abstract algebraic framework of analogical proportions in the domain of logic programming. Technically, we define proportions in terms of modularity where we derive abstract forms of concrete programs from a "known" source domain which can then be instantiated in an "unknown" target domain to obtain analogous programs. To this end, we introduce algebraic operations for syntactic logic program composition and concatenation. Interestingly, our work suggests a close relationship between modularity, generalization, and analogy which we believe should be explored further in the future. In a broader sense, this paper is a further step towards a mathematical theory of logic-based analogical reasoning and learning with potential applications to open AI-problems like commonsense reasoning and computational learning and creativity.
Bidirectional typing is a discipline in which the typing judgment is decomposed explicitly into inference and checking modes, allowing to control the flow of type information in typing rules and to specify algorithmically how they should be used. Bidirectional typing has been fruitfully studied and bidirectional systems have been developed for many type theories. However, the formal development of bidirectional typing has until now been kept confined to specific theories, with general guidelines remaining informal. In this work, we give a generic account of bidirectional typing for a general class of dependent type theories. This is done by first giving a general definition of type theories (or equivalently, a logical framework), for which we define declarative and bidirectional type systems. We then show, in a theory-independent fashion, that the two systems are equivalent. This equivalence is then explored to establish the decidability of typing for weak normalizing theories, yielding a generic type-checking algorithm that has been implemented in a prototype and used in practice with many theories.
Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an "episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo and causal discovery tasks.
Manifold learning flows are a class of generative modelling techniques that assume a low-dimensional manifold description of the data. The embedding of such manifold into the high-dimensional space of the data is achieved via learnable invertible transformations. Therefore, once the manifold is properly aligned via a reconstruction loss, the probability density is tractable on the manifold and maximum likelihood can be used optimize the network parameters. Naturally, the lower-dimensional representation of the data requires an injective-mapping. Recent approaches were able to enforce that density aligns with the modelled manifold, while efficiently calculating the density volume-change term when embedding to the higher-dimensional space. However, unless the injective-mapping is analytically predefined, the learned manifold is not necessarily an efficient representation of the data. Namely, the latent dimensions of such models frequently learn an entangled intrinsic basis with degenerate information being stored in each dimension. Alternatively, if a locally orthogonal and/or sparse basis is to be learned, here coined canonical intrinsic basis, it can serve in learning a more compact latent space representation. Towards this end, we propose a canonical manifold learning flow method, where a novel optimization objective enforces the transformation matrix to have few prominent and orthogonal basis functions. Canonical manifold flow yields a more efficient use of the latent space, automatically generating fewer prominent and distinct dimensions to represent data, and consequently a better approximation of target distributions than other manifold flow methods in most experiments we conducted, resulting in lower FID scores.
In this paper the interpolating rational functions introduced by Floater and Hormann are generalized leading to a whole new family of rational functions depending on $\gamma$, an additional positive integer parameter. For $\gamma = 1$, the original Floater--Hormann interpolants are obtained. When $\gamma>1$ we prove that the new rational functions share a lot of the nice properties of the original Floater--Hormann functions. Indeed, for any configuration of nodes in a compact interval, they have no real poles, interpolate the given data, preserve the polynomials up to a certain fixed degree, and have a barycentric-type representation. Moreover, we estimate the associated Lebesgue constants in terms of the minimum ($h^*$) and maximum ($h$) distance between two consecutive nodes. It turns out that, in contrast to the original Floater-Hormann interpolants, for all $\gamma > 1$ we get uniformly bounded Lebesgue constants in the case of equidistant and quasi-equidistant nodes configurations (i.e., when $h\sim h^*$). For such configurations, as the number of nodes tends to infinity, we prove that the new interpolants ($\gamma>1$) uniformly converge to the interpolated function $f$, for any continuous function $f$ and all $\gamma>1$. The same is not ensured by the original FH interpolants ($\gamma=1$). Moreover, we provide uniform and pointwise estimates of the approximation error for functions having different degrees of smoothness. Numerical experiments illustrate the theoretical results and show a better error profile for less smooth functions compared to the original Floater-Hormann interpolants.
Heteroskedasticity testing in nonparametric regression is a classic statistical problem with important practical applications, yet fundamental limits are unknown. Adopting a minimax perspective, this article considers the testing problem in the context of an $\alpha$-H\"{o}lder mean and a $\beta$-H\"{o}lder variance function. For $\alpha > 0$ and $\beta \in (0, \frac{1}{2})$, the sharp minimax separation rate $n^{-4\alpha} + n^{-\frac{4\beta}{4\beta+1}} + n^{-2\beta}$ is established. To achieve the minimax separation rate, a kernel-based statistic using first-order squared differences is developed. Notably, the statistic estimates a proxy rather than a natural quadratic functional (the squared distance between the variance function and its best $L^2$ approximation by a constant) suggested in previous work. The setting where no smoothness is assumed on the variance function is also studied; the variance profile across the design points can be arbitrary. Despite the lack of structure, consistent testing turns out to still be possible by using the Gaussian character of the noise, and the minimax rate is shown to be $n^{-4\alpha} + n^{-1/2}$. Exploiting noise information happens to be a fundamental necessity as consistent testing is impossible if nothing more than zero mean and unit variance is known about the noise distribution. Furthermore, in the setting where $V$ is $\beta$-H\"{o}lder but heteroskedasticity is measured only with respect to the design points, the minimax separation rate is shown to be $n^{-4\alpha} + n^{-\left(\frac{1}{2} \vee \frac{4\beta}{4\beta+1}\right)}$ when the noise is Gaussian and $n^{-4\alpha} + n^{-\frac{4\beta}{4\beta+1}} + n^{-2\beta}$ when the noise distribution is unknown.
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at //github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at //gitee.com/mindspore/models/tree/master/research/cv/TNT.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
Deep learning is usually described as an experiment-driven field under continuous criticizes of lacking theoretical foundations. This problem has been partially fixed by a large volume of literature which has so far not been well organized. This paper reviews and organizes the recent advances in deep learning theory. The literature is categorized in six groups: (1) complexity and capacity-based approaches for analyzing the generalizability of deep learning; (2) stochastic differential equations and their dynamic systems for modelling stochastic gradient descent and its variants, which characterize the optimization and generalization of deep learning, partially inspired by Bayesian inference; (3) the geometrical structures of the loss landscape that drives the trajectories of the dynamic systems; (4) the roles of over-parameterization of deep neural networks from both positive and negative perspectives; (5) theoretical foundations of several special structures in network architectures; and (6) the increasingly intensive concerns in ethics and security and their relationships with generalizability.
We propose a novel method for automatic reasoning on knowledge graphs based on debate dynamics. The main idea is to frame the task of triple classification as a debate game between two reinforcement learning agents which extract arguments -- paths in the knowledge graph -- with the goal to promote the fact being true (thesis) or the fact being false (antithesis), respectively. Based on these arguments, a binary classifier, called the judge, decides whether the fact is true or false. The two agents can be considered as sparse, adversarial feature generators that present interpretable evidence for either the thesis or the antithesis. In contrast to other black-box methods, the arguments allow users to get an understanding of the decision of the judge. Since the focus of this work is to create an explainable method that maintains a competitive predictive accuracy, we benchmark our method on the triple classification and link prediction task. Thereby, we find that our method outperforms several baselines on the benchmark datasets FB15k-237, WN18RR, and Hetionet. We also conduct a survey and find that the extracted arguments are informative for users.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.