As a signal recovery algorithm, compressed sensing is particularly useful when the data has low-complexity and samples are rare, which matches perfectly with the task of quantum phase estimation (QPE). In this work we present a new Heisenberg-limited QPE algorithm for early quantum computers based on compressed sensing. More specifically, given many copies of a proper initial state and queries to some unitary operators, our algorithm is able to recover the frequency with a total runtime $\mathcal{O}(\epsilon^{-1}\text{poly}\log(\epsilon^{-1}))$, where $\epsilon$ is the accuracy. Moreover, the maximal runtime satisfies $T_{\max}\epsilon \ll \pi$, which is comparable to the state of art algorithms, and our algorithm is also robust against certain amount of noise from sampling. We also consider the more general quantum eigenvalue estimation problem (QEEP) and show numerically that the off-grid compressed sensing can be a strong candidate for solving the QEEP.
Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for a framework which can handle evolutionary change.
Instruments can be used to identify causal effects in the presence of unobserved confounding, under the famous relevance and exogeneity (unconfoundedness and exclusion) assumptions. As exogeneity is difficult to justify and to some degree untestable, it often invites criticism in applications. Hoping to alleviate this problem, we propose a novel identification approach, which relaxes traditional IV exogeneity to exogeneity conditional on some unobserved common confounders. We assume there exist some relevant proxies for the unobserved common confounders. Unlike typical proxies, our proxies can have a direct effect on the endogenous regressor and the outcome. We provide point identification results with a linearly separable outcome model in the disturbance, and alternatively with strict monotonicity in the first stage. General doubly robust and Neyman orthogonal moments are derived consecutively to enable the straightforward root-n estimation of low-dimensional parameters despite the high-dimensionality of nuisances, themselves non-uniquely defined by Fredholm integral equations. Using this novel method with NLS97 data, we separate ability bias from general selection bias in the economic returns to education problem.
Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.
Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is, "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case, and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at //github.com/thomaskeefe/sigclust.
The mainstream of data-driven abstractive summarization models tends to explore the correlations rather than the causal relationships. Among such correlations, there can be spurious ones which suffer from the language prior learned from the training corpus and therefore undermine the overall effectiveness of the learned model. To tackle this issue, we introduce a Structural Causal Model (SCM) to induce the underlying causal structure of the summarization data. We assume several latent causal factors and non-causal factors, representing the content and style of the document and summary. Theoretically, we prove that the latent factors in our SCM can be identified by fitting the observed training data under certain conditions. On the basis of this, we propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) to learn the causal representations that can mimic the causal factors, guiding us to pursue causal information for summary generation. The key idea is to reformulate the Variational Auto-encoder (VAE) to fit the joint distribution of the document and summary variables from the training corpus. Experimental results on two widely used text summarization datasets demonstrate the advantages of our approach.
This manuscript portrays optimization as a process. In many practical applications the environment is so complex that it is infeasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary as well as beneficial to take a robust approach, by applying an optimization method that learns as one goes along, learning from experience as more aspects of the problem are observed. This view of optimization as a process has become prominent in varied fields and has led to some spectacular success in modeling and systems that are now part of our daily lives.
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifact surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to be taken into account by current semantic segmentation network based methods. Our thoughts are realized by a new network which we term MVSS-Net. Extensive experiments on five benchmark sets justify the viability of MVSS-Net for both pixel-level and image-level manipulation detection.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.
We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.
Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate adversarial perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under state-of-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.