A Transformer-based deep direct sampling method is proposed for a class of boundary value inverse problems. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a specific example to a fundamental question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural networks? Specifically, inspired by direct sampling methods for inverse problems, the 1D boundary data in different frequencies are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions as different input channels. Then, by introducing learnable non-local kernels, the direct sampling is recast to a modified attention mechanism. The proposed method is then applied to electrical impedance tomography, a well-known severely ill-posed nonlinear inverse problem. The new method achieves superior accuracy over its predecessors and contemporary operator learners, as well as shows robustness with respect to noise. This research shall strengthen the insights that the attention mechanism, despite being invented for natural language processing tasks, offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures.
Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty of deep learning based MCG methods, which claim they largely outperform the traditional methods. However, to our surprise, we design a simple and cheap algorithm (parameter-free) based on the traditional methods and find it is comparable to or even outperforms deep learning based MCG methods in the widely used GEOM-QM9 and GEOM-Drugs benchmarks. In particular, our design algorithm is simply the clustering of the RDKIT-generated conformations. We hope our findings can help the community to revise the deep learning methods for MCG. The code of the proposed algorithm could be found at //gist.github.com/ZhouGengmo/5b565f51adafcd911c0bc115b2ef027c.
Mixed-precision quantization (MPQ) suffers from time-consuming policy search process (i.e., the bit-width assignment for each layer) on large-scale datasets (e.g., ISLVRC-2012), which heavily limits its practicability in real-world deployment scenarios. In this paper, we propose to search the effective MPQ policy by using a small proxy dataset for the model trained on a large-scale one. It breaks the routine that requires a consistent dataset at model training and MPQ policy search time, which can improve the MPQ searching efficiency significantly. However, the discrepant data distributions bring difficulties in searching for such a transferable MPQ policy. Motivated by the observation that quantization narrows the class margin and blurs the decision boundary, we search the policy that guarantees a general and dataset-independent property: discriminability of feature representations. Namely, we seek the policy that can robustly keep the intra-class compactness and inter-class separation. Our method offers several advantages, i.e., high proxy data utilization, no extra hyper-parameter tuning for approximating the relationship between full-precision and quantized model and high searching efficiency. We search high-quality MPQ policies with the proxy dataset that has only 4% of the data scale compared to the large-scale target dataset, achieving the same accuracy as searching directly on the latter, and improving the MPQ searching efficiency by up to 300 times.
Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods.
We consider the problem of sequentially optimizing a time-varying objective function using time-varying Bayesian optimization (TVBO). Here, the key challenge is the exploration-exploitation trade-off under time variations. Current approaches to TVBO require prior knowledge of a constant rate of change. However, the rate of change is usually neither known nor constant. We propose an event-triggered algorithm, ET-GP-UCB, that treats the optimization problem as static until it detects changes in the objective function online and then resets the dataset. This allows the algorithm to adapt to realized temporal changes without the need for prior knowledge. The event-trigger is based on probabilistic uniform error bounds used in Gaussian process regression. We provide regret bounds for ET-GP-UCB and show in numerical experiments that it is competitive with state-of-the-art algorithms even though it requires no knowledge about the temporal changes. Further, ET-GP-UCB outperforms these baselines if the rate of change is misspecified, and we demonstrate that it is readily applicable to various settings without tuning hyperparameters.
Waveform inversion is concerned with estimating a heterogeneous medium, modeled by variable coefficients of wave equations, using sources that emit probing signals and receivers that record the generated waves. It is an old and intensively studied inverse problem with a wide range of applications, but the existing inversion methodologies are still far from satisfactory. The typical mathematical formulation is a nonlinear least squares data fit optimization and the difficulty stems from the non-convexity of the objective function that displays numerous local minima at which local optimization approaches stagnate. This pathological behavior has at least three unavoidable causes: (1) The mapping from the unknown coefficients to the wave field is nonlinear and complicated. (2) The sources and receivers typically lie on a single side of the medium, so only backscattered waves are measured. (3) The probing signals are band limited and with high frequency content. There is a lot of activity in the computational science and engineering communities that seeks to mitigate the difficulty of estimating the medium by data fitting. In this paper we present a different point of view, based on reduced order models (ROMs) of two operators that control the wave propagation. The ROMs are called data driven because they are computed directly from the measurements, without any knowledge of the wave field inside the inaccessible medium. This computation is non-iterative and uses standard numerical linear algebra methods. The resulting ROMs capture features of the physics of wave propagation in a complementary way and have surprisingly good approximation properties that facilitate waveform inversion.
Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.
We devise and analyze $C^0$-conforming hybrid high-order (HHO) methods to approximate biharmonic problems with either clamped or simply supported boundary conditions. $C^0$-conforming HHO methods hinge on cell unknowns which are $C^0$-conforming polynomials of order $(k+2)$ approximating the solution in the mesh cells and on face unknowns which are polynomials of order $k\ge0$ approximating the normal derivative of the solution on the mesh skeleton. Such methods deliver $O(h^{k+1})$ $H^2$-error estimates for smooth solutions. An important novelty in the error analysis is to lower the minimal regularity requirement on the exact solution. The technique to achieve this has broader applicability than just $C^0$-conforming HHO methods, and to illustrate this point, we outline the error analysis for the well-known $C^0$-conforming interior penalty discontinuous Galerkin (IPDG) methods as well. The present technique does not require bubble functions or a $C^1$-smoother to evaluate the right-hand side in case of rough loads. Finally, numerical results including comparisons to various existing methods showcase the efficiency of the proposed $C^0$-conforming HHO methods.
Inductive programming frequently relies on some form of search in order to identify candidate solutions. However, the size of the search space limits the use of inductive programming to the production of relatively small programs. If we could somehow correctly predict the subset of instructions required for a given problem then inductive programming would be more tractable. We will show that this can be achieved in a high percentage of cases. This paper presents a novel model of programming language instruction co-occurrence that was built to support search space partitioning in the Zoea distributed inductive programming system. This consists of a collection of intersecting instruction subsets derived from a large sample of open source code. Using the approach different parts of the search space can be explored in parallel. The number of subsets required does not grow linearly with the quantity of code used to produce them and a manageable number of subsets is sufficient to cover a high percentage of unseen code. This approach also significantly reduces the overall size of the search space - often by many orders of magnitude.
Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.
Classic algorithms and machine learning systems like neural networks are both abundant in everyday life. While classic computer science algorithms are suitable for precise execution of exactly defined tasks such as finding the shortest path in a large graph, neural networks allow learning from data to predict the most likely answer in more complex tasks such as image classification, which cannot be reduced to an exact algorithm. To get the best of both worlds, this thesis explores combining both concepts leading to more robust, better performing, more interpretable, more computationally efficient, and more data efficient architectures. The thesis formalizes the idea of algorithmic supervision, which allows a neural network to learn from or in conjunction with an algorithm. When integrating an algorithm into a neural architecture, it is important that the algorithm is differentiable such that the architecture can be trained end-to-end and gradients can be propagated back through the algorithm in a meaningful way. To make algorithms differentiable, this thesis proposes a general method for continuously relaxing algorithms by perturbing variables and approximating the expectation value in closed form, i.e., without sampling. In addition, this thesis proposes differentiable algorithms, such as differentiable sorting networks, differentiable renderers, and differentiable logic gate networks. Finally, this thesis presents alternative training strategies for learning with algorithms.