Let $k,\ell\geq 2$ be two multiplicatively independent integers. Cobham's famous theorem states that a set $X\subseteq \mathbb{N}$ is both $k$-recognizable and $\ell$-recognizable if and only if it is definable in Presburger arithmetic. Here we show the following strengthening: let $X\subseteq \mathbb{N}^m$ be $k$-recognizable, let $Y\subseteq \mathbb{N}^m$ be $\ell$-recognizable such that both $X$ and $Y$ are not definable in Presburger arithmetic. Then the first-order logical theory of $(\mathbb{N},+,X,Y)$ is undecidable. This is in contrast to a well-known theorem of B\"uchi that the first-order logical theory of $(\mathbb{N},+,X)$ is decidable.
We consider the problem of untangling a given (non-planar) straight-line circular drawing $\delta_G$ of an outerplanar graph $G=(V, E)$ into a planar straight-line circular drawing by shifting a minimum number of vertices to a new position on the circle. For an outerplanar graph $G$, it is clear that such a crossing-free circular drawing always exists and we define the circular shifting number shift$(\delta_G)$ as the minimum number of vertices that are required to be shifted in order to resolve all crossings of $\delta_G$. We show that the problem Circular Untangling, asking whether shift$(\delta_G) \le K$ for a given integer $K$, is NP-complete. For $n$-vertex outerplanar graphs, we obtain a tight upper bound of shift$(\delta_G) \le n - \lfloor\sqrt{n-2}\rfloor -2$. Based on these results we study Circular Untangling for almost-planar circular drawings, in which a single edge is involved in all the crossings. In this case, we provide a tight upper bound shift$(\delta_G) \le \lfloor \frac{n}{2} \rfloor-1$ and present a constructive polynomial-time algorithm to compute the circular shifting number of almost-planar drawings.
This manuscript portrays optimization as a process. In many practical applications the environment is so complex that it is infeasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary as well as beneficial to take a robust approach, by applying an optimization method that learns as one goes along, learning from experience as more aspects of the problem are observed. This view of optimization as a process has become prominent in varied fields and has led to some spectacular success in modeling and systems that are now part of our daily lives.
The performance of a distillation-based compressed network is governed by the quality of distillation. The reason for the suboptimal distillation of a large network (teacher) to a smaller network (student) is largely attributed to the gap in the learning capacities of given teacher-student pair. While it is hard to distill all the knowledge of a teacher, the quality of distillation can be controlled to a large extent to achieve better performance. Our experiments show that the quality of distillation is largely governed by the quality of teacher's response, which in turn is heavily affected by the presence of similarity information in its response. A well-trained large capacity teacher loses similarity information between classes in the process of learning fine-grained discriminative properties for classification. The absence of similarity information causes the distillation process to be reduced from one example-many class learning to one example-one class learning, thereby throttling the flow of diverse knowledge from the teacher. With the implicit assumption that only the instilled knowledge can be distilled, instead of focusing only on the knowledge distilling process, we scrutinize the knowledge inculcation process. We argue that for a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher. We discuss the steps to find this sweet spot for better distillation. We also propose the distillation hypothesis to differentiate the behavior of the distillation process between knowledge distillation and regularization effect. We conduct all our experiments on three different datasets.
Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of this framework to high dimensions, we investigate the structural and statistical behavior of the Gaussian-smoothed $p$-Wasserstein distance $\mathsf{W}_p^{(\sigma)}$, for arbitrary $p\geq 1$. After establishing basic metric and topological properties of $\mathsf{W}_p^{(\sigma)}$, we explore the asymptotic statistical behavior of $\mathsf{W}_p^{(\sigma)}(\hat{\mu}_n,\mu)$, where $\hat{\mu}_n$ is the empirical distribution of $n$ independent observations from $\mu$. We prove that $\mathsf{W}_p^{(\sigma)}$ enjoys a parametric empirical convergence rate of $n^{-1/2}$, which contrasts the $n^{-1/d}$ rate for unsmoothed $\mathsf{W}_p$ when $d \geq 3$. Our proof relies on controlling $\mathsf{W}_p^{(\sigma)}$ by a $p$th-order smooth Sobolev distance $\mathsf{d}_p^{(\sigma)}$ and deriving the limit distribution of $\sqrt{n}\,\mathsf{d}_p^{(\sigma)}(\hat{\mu}_n,\mu)$, for all dimensions $d$. As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation using $\mathsf{W}_p^{(\sigma)}$, with experiments for $p=2$ using a maximum mean discrepancy formulation of $\mathsf{d}_2^{(\sigma)}$.
This paper is concerned with the regularity of solutions to parabolic evolution equations. Special attention is paid to the smoothness in the specific anisotropic scale $\ B^{r\mathbf{a}}_{\tau,\tau}, \ \frac{1}{\tau}=\frac{r}{d}+\frac{1}{p}\ $ of Besov spaces where $\mathbf{a}$ measures the anisotropy. The regularity in these spaces determines the approximation order that can be achieved by fully space-time adaptive approximation schemes. In particular, we show that for the heat equation our results significantly improve previous results by Aimar and Gomez [3].
Hamiltonian cycles in graphs were first studied in the 1850s. Since then, an impressive amount of research has been dedicated to identifying classes of graphs that allow Hamiltonian cycles, and to related questions. The corresponding decision problem, that asks whether a given graph is Hamiltonian (i.\,e.\ admits a Hamiltonian cycle), is one of Karp's famous NP-complete problems. In this paper we study graphs of bounded degree that are \emph{far} from being Hamiltonian, where a graph $G$ on $n$ vertices is \emph{far} from being Hamiltonian, if modifying a constant fraction of $n$ edges is necessary to make $G$ Hamiltonian. We give an explicit deterministic construction of a class of graphs of bounded degree that are locally Hamiltonian, but (globally) far from being Hamiltonian. Here, \emph{locally Hamiltonian} means that every subgraph induced by the neighbourhood of a small vertex set appears in some Hamiltonian graph. More precisely, we obtain graphs which differ in $\Theta(n)$ edges from any Hamiltonian graph, but non-Hamiltonicity cannot be detected in the neighbourhood of $o(n)$ vertices. Our class of graphs yields a class of hard instances for one-sided error property testers with linear query complexity. It is known that any property tester (even with two-sided error) requires a linear number of queries to test Hamiltonicity (Yoshida, Ito, 2010). This is proved via a randomised construction of hard instances. In contrast, our construction is deterministic. So far only very few deterministic constructions of hard instances for property testing are known. We believe that our construction may lead to future insights in graph theory and towards a characterisation of the properties that are testable in the bounded-degree model.
In this paper, we consider the time-inhomogeneous nonlinear time series regression for a general class of locally stationary time series. On one hand, we propose sieve nonparametric estimators for the time-varying regression functions which can achieve the min-max optimal rate. On the other hand, we develop a unified simultaneous inferential theory which can be used to conduct both structural and exact form testings on the functions. Our proposed statistics are powerful even under locally weak alternatives. We also propose a multiplier bootstrapping procedure for practical implementation. Our methodology and theory do not require any structural assumptions on the regression functions and we also allow the functions to be supported in an unbounded domain. We also establish sieve approximation theory for 2-D functions in unbounded domain and a Gaussian approximation result for affine and quadratic forms for high dimensional locally stationary time series, which can be of independent interest. Numerical simulations and a real financial data analysis are provided to support our results.
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.