We consolidate two widely believed conjectures about tautologies -- no optimal proof system exists, and most require superpolynomial size proofs in any system -- into a $p$-isomorphism-invariant condition satisfied by all paddable $\textbf{coNP}$-complete languages or none. The condition is: for any Turing machine (TM) $M$ accepting the language, $\textbf{P}$-uniform input families requiring superpolynomial time by $M$ exist (equivalent to the first conjecture) and appear with positive upper density in an enumeration of input families (implies the second). In that case, no such language is easy on average (in $\textbf{AvgP}$) for a distribution applying non-negligible weight to the hard families. The hardness of proving tautologies and theorems is likely related. Motivated by the fact that arithmetic sentences encoding "string $x$ is Kolmogorov random" are true but unprovable with positive density in a finitely axiomatized theory $\mathcal{T}$ (Calude and J{\"u}rgensen), we conjecture that any propositional proof system requires superpolynomial size proofs for a dense set of $\textbf{P}$-uniform families of tautologies encoding "there is no $\mathcal{T}$ proof of size $\leq t$ showing that string $x$ is Kolmogorov random". This implies the above condition. The conjecture suggests that there is no optimal proof system because undecidable theories help prove tautologies and do so more efficiently as axioms are added, and that constructing hard tautologies seems difficult because it is impossible to construct Kolmogorov random strings. Similar conjectures that computational blind spots are manifestations of noncomputability would resolve other open problems.
A support vector machine (SVM) is an algorithm that finds a hyperplane which optimally separates labeled data points in $\mathbb{R}^n$ into positive and negative classes. The data points on the margin of this separating hyperplane are called support vectors. We connect the possible configurations of support vectors to Radon's theorem, which provides guarantees for when a set of points can be divided into two classes (positive and negative) whose convex hulls intersect. If the convex hulls of the positive and negative support vectors are projected onto a separating hyperplane, then the projections intersect if and only if the hyperplane is optimal. Further, with a particular type of general position, we show that (a) the projected convex hulls of the support vectors intersect in exactly one point, (b) the support vectors are stable under perturbation, (c) there are at most $n+1$ support vectors, and (d) every number of support vectors from 2 up to $n+1$ is possible. Finally, we perform computer simulations studying the expected number of support vectors, and their configurations, for randomly generated data. We observe that as the distance between classes of points increases for this type of randomly generated data, configurations with fewer support vectors become more likely.
NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at //libauc.org/.
The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are independent (given all other variables). Glasso lacks, however, robustness to outliers. To overcome this problem, one typically applies a robust plug-in procedure where the Glasso is computed from a robust covariance estimate instead of the sample covariance, thereby providing protection against outliers. In this paper, we study such estimators theoretically, by deriving and comparing their influence function, sensitivity curves and asymptotic variances.
We study a patrolling game played on a network $Q$, considered as a metric space. The Attacker chooses a point of $Q$ (not necessarily a node) to attack during a chosen time interval of fixed duration. The Patroller chooses a unit speed path on $Q$ and intercepts the attack (and wins) if she visits the attacked point during the attack time interval. This zero-sum game models the problem of protecting roads or pipelines from an adversarial attack. The payoff to the maximizing Patroller is the probability that the attack is intercepted. Our results include the following: (i) a solution to the game for any network $Q$, as long as the time required to carry out the attack is sufficiently short, (ii) a solution to the game for all tree networks that satisfy a certain condition on their extremities, and (iii) a solution to the game for any attack duration for stars with one long arc and the remaining arcs equal in length. We present a conjecture on the solution of the game for arbitrary trees and establish it in certain cases.
As machine learning becomes prevalent, mitigating any unfairness present in the training data becomes critical. Among the various notions of fairness, this paper focuses on the well-known individual fairness, which states that similar individuals should be treated similarly. While individual fairness can be improved when training a model (in-processing), we contend that fixing the data before model training (pre-processing) is a more fundamental solution. In particular, we show that label flipping is an effective pre-processing technique for improving individual fairness. Our system iFlipper solves the optimization problem of minimally flipping labels given a limit to the individual fairness violations, where a violation occurs when two similar examples in the training data have different labels. We first prove that the problem is NP-hard. We then propose an approximate linear programming algorithm and provide theoretical guarantees on how close its result is to the optimal solution in terms of the number of label flips. We also propose techniques for making the linear programming solution more optimal without exceeding the violations limit. Experiments on real datasets show that iFlipper significantly outperforms other pre-processing baselines in terms of individual fairness and accuracy on unseen test sets. In addition, iFlipper can be combined with in-processing techniques for even better results.
We investigate models that can generate arbitrary natural language text (e.g. all English sentences) from a bounded, convex and well-behaved control space. We call them universal vec2text models. Such models would allow making semantic decisions in the vector space (e.g. via reinforcement learning) while the natural language generation is handled by the vec2text model. We propose four desired properties: universality, diversity, fluency, and semantic structure, that such vec2text models should possess and we provide quantitative and qualitative methods to assess them. We implement a vec2text model by adding a bottleneck to a 250M parameters Transformer model and training it with an auto-encoding objective on 400M sentences (10B tokens) extracted from a massive web corpus. We propose a simple data augmentation technique based on round-trip translations and show in extensive experiments that the resulting vec2text model surprisingly leads to vector spaces that fulfill our four desired properties and that this model strongly outperforms both standard and denoising auto-encoders.
In the following report we propose pipelines for Goodness of Pronunciation (GoP) computation solving OOV problem at testing time using Vocab/Lexicon expansion techniques. The pipeline uses different components of ASR system to quantify accent and automatically evaluate them as scores. We use the posteriors of an ASR model trained on native English speech, along with the phone level boundaries to obtain phone level pronunciation scores. We used this as a baseline pipeline and implemented methods to remove UNK and SPN phonemes in the GoP output by building three pipelines. The Online, Offline and Hybrid pipeline which returns the scores but also can prevent unknown words in the final output. The Online method is based per utterance, Offline method pre-incorporates a set of OOV words for a given data set and the Hybrid method combines the above two ideas to expand the lexicon as well work per utterance. We further provide utilities such as the Phoneme to posterior mappings, GoP scores of each utterance as a vector, and Word boundaries used in the GoP pipeline for use in future research.
Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and prediction intervals for nonlinear regression models. These simple and effective techniques have been completely ignored so far in the genetic programming literature. In this work we describe the calculation of likelihood profiles in details and also provide some illustrative examples with models created with three different symbolic regression algorithms on two different datasets. The examples highlight the importance of the likelihood profiles to understand the limitations of symbolic regression models and to help the user taking an informed post-prediction decision.
Let $L$ be an $n\times n$ array whose top left $r\times r$ subarray is filled with $k$ different symbols, each occurring at most once in each row and at most once in each column. We establish necessary and sufficient conditions that ensure the remaining cells of $L$ can be filled such that each symbol occurs at most once in each row and at most once in each column, $L$ is symmetric with respect to the main diagonal, and each symbol occurs a prescribed number of times in $L$. The case where the prescribed number of times each symbol occurs is $n$ was solved by Cruse (J. Combin. Theory Ser. A 16 (1974), 18-22), and the case where the top left subarray is $r\times n$ and the symmetry is not required, was settled by Goldwasser et al. (J. Combin. Theory Ser. A 130 (2015), 26-41). Our result allows the entries of the main diagonal to be specified as well, which leads to an extension of the Andersen-Hoffman's Theorem (Annals of Disc. Math. 15 (1982) 9-26, European J. Combin. 4 (1983) 33-35).
When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on dynamic programming and on local interaction rules. We present a theoretical analysis of both approaches and our results are extensively validated using simulations. Interestingly, our results demonstrate that the viability of heterogeneous teams depends on the amount of information available to the defenders. Moreover, our results suggest a universality property: across a wide range of problem parameters the optimal ratio of the speeds of the defenders remains nearly constant.