The compositionality and local reasoning properties of separation logic have led to significant advances in scalable static analysis. But new requirements for program analysis have emerged -- many programs display computational effects (such as randomization) and, orthogonally, static analysis for incorrectness has proved to be very effective. We present Outcome Separation Logic (OSL), the first variant of separation logic that is sound for both correctness and incorrectness reasoning with varying computational effects. OSL has a frame rule that resembles that of standard Separation Logic, however we make different underlying assumptions in order to lift restrictions imposed by SL that preclude reasoning about incorrectness and effects. Building on this fundamental theory, we also define symbolic execution algorithms that use bi-abduction to derive specifications for programs with effects. This involves a new tri-abduction procedure to analyze programs whose execution branches due to effects such as nondeterministic or probabilistic choice. This work furthers the compositionality promised by separation logic by opening up the possibility for greater reuse of analysis tools across two dimensions: bug-finding and verification across programs with varying effects.
Latent variable models are powerful tools for modeling complex phenomena involving in particular partially observed data, unobserved variables or underlying complex unknown structures. Inference is often difficult due to the latent structure of the model. To deal with parameter estimation in the presence of latent variables, well-known efficient methods exist, such as gradient-based and EM-type algorithms, but with practical and theoretical limitations. In this paper, we propose as an alternative for parameter estimation an efficient preconditioned stochastic gradient algorithm. Our method includes a preconditioning step based on a positive definite Fisher information matrix estimate. We prove convergence results for the proposed algorithm under mild assumptions for very general latent variables models. We illustrate through relevant simulations the performance of the proposed methodology in a nonlinear mixed effects model and in a stochastic block model.
The optimal branch number of MDS matrices makes them a preferred choice for designing diffusion layers in many block ciphers and hash functions. However, in lightweight cryptography, Near-MDS (NMDS) matrices with sub-optimal branch numbers offer a better balance between security and efficiency as a diffusion layer, compared to MDS matrices. In this paper, we study NMDS matrices, exploring their construction in both recursive and nonrecursive settings. We provide several theoretical results and explore the hardware efficiency of the construction of NMDS matrices. Additionally, we make comparisons between the results of NMDS and MDS matrices whenever possible. For the recursive approach, we study the DLS matrices and provide some theoretical results on their use. Some of the results are used to restrict the search space of the DLS matrices. We also show that over a field of characteristic 2, any sparse matrix of order $n\geq 4$ with fixed XOR value of 1 cannot be an NMDS when raised to a power of $k\leq n$. Following that, we use the generalized DLS (GDLS) matrices to provide some lightweight recursive NMDS matrices of several orders that perform better than the existing matrices in terms of hardware cost or the number of iterations. For the nonrecursive construction of NMDS matrices, we study various structures, such as circulant and left-circulant matrices, and their generalizations: Toeplitz and Hankel matrices. In addition, we prove that Toeplitz matrices of order $n>4$ cannot be simultaneously NMDS and involutory over a field of characteristic 2. Finally, we use GDLS matrices to provide some lightweight NMDS matrices that can be computed in one clock cycle. The proposed nonrecursive NMDS matrices of orders 4, 5, 6, 7, and 8 can be implemented with 24, 50, 65, 96, and 108 XORs over $\mathbb{F}_{2^4}$, respectively.
Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves. We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called \textit{k-Min-Cost-Arborescence} (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called \textit{arborescence}. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables). Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.
Integer linear programming (ILP) models a wide range of practical combinatorial optimization problems and has significant impacts in industry and management sectors. This work proposes new characterizations of ILP with the concept of boundary solutions. Motivated by the new characterizations, we develop an efficient local search solver, which is the first local search solver for general ILP validated on a large heterogeneous problem dataset. We propose a new local search framework that switches between three modes, namely Search, Improve, and Restore modes. We design tailored operators adapted to different modes, thus improving the quality of the current solution according to different situations. For the Search and Restore modes, we propose an operator named tight move, which adaptively modifies variables' values, trying to make some constraint tight. For the Improve mode, an efficient operator lift move is proposed to improve the quality of the objective function while maintaining feasibility. Putting these together, we develop a local search solver for integer linear programming called Local-ILP. Experiments conducted on the MIPLIB dataset show the effectiveness of our solver in solving large-scale hard integer linear programming problems within a reasonably short time. Local-ILP is competitive and complementary to the state-of-the-art commercial solver Gurobi and significantly outperforms the state-of-the-art non-commercial solver SCIP. Moreover, our solver establishes new records for 6 MIPLIB open instances. The theoretical analysis of our algorithm is also presented, which shows our algorithm could avoid visiting unnecessary regions and also maintain good connectivity of targeted solutions.
Motivated by the Acute Respiratory Distress Syndrome Network (ARDSNetwork) ARDS respiratory management (ARMA) trial, we developed a flexible Bayesian machine learning approach to estimate the average causal effect and heterogeneous causal effects among the always-survivors stratum when clinical outcomes are subject to truncation. We adopted Bayesian additive regression trees (BART) to flexibly specify separate models for the potential outcomes and latent strata membership. In the analysis of the ARMA trial, we found that the low tidal volume treatment had an overall benefit for participants sustaining acute lung injuries on the outcome of time to returning home, but substantial heterogeneity in treatment effects among the always-survivors, driven most strongly by sex and the alveolar-arterial oxygen gradient at baseline (a physiologic measure of lung function and source of hypoxemia). These findings illustrate how the proposed methodology could guide the prognostic enrichment of future trials in the field. We also demonstrated through a simulation study that our proposed Bayesian machine learning approach outperforms other parametric methods in reducing the estimation bias in both the average causal effect and heterogeneous causal effects for always-survivors.
Tensor train (TT) representation has achieved tremendous success in visual data completion tasks, especially when it is combined with tensor folding. However, folding an image or video tensor breaks the original data structure, leading to local information loss as nearby pixels may be assigned into different dimensions and become far away from each other. In this paper, to fully preserve the local information of the original visual data, we explore not folding the data tensor, and at the same time adopt graph information to regularize local similarity between nearby entries. To overcome the high computational complexity introduced by the graph-based regularization in the TT completion problem, we propose to break the original problem into multiple sub-problems with respect to each TT core fiber, instead of each TT core as in traditional methods. Furthermore, to avoid heavy parameter tuning, a sparsity promoting probabilistic model is built based on the generalized inverse Gaussian (GIG) prior, and an inference algorithm is derived under the mean-field approximation. Experiments on both synthetic data and real-world visual data show the superiority of the proposed methods.
We give a simple, direct and reusable logical relations technique for languages with recursive features and partially defined differentiable functions. We demonstrate it by working out the case of Automatic Differentiation (AD) correctness: namely, we present a proof of the dual numbers style AD macro correctness for realistic functional languages in the ML-family. We also show how this macro provides us with correct forward- and \textit{reverse-mode} AD. The starting point is to interpret a functional programming language in a suitable freely generated categorical structure. In this setting, by the universal property of the syntactic categorical structure, the dual numbers AD macro and the basic $\omega$-cpo semantics arise as structure preserving functors. The proof follows, then, by a novel logical relations argument. The key to much of our contribution is a powerful monadic logical relations technique for term recursion and recursive types. It provides us with a semantic correctness proof based on a simple approach for denotational semantics, making use only of the very basic concrete model of $\omega$-cpos.
A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magnitude. However, success of learned indexes has not been theoretically justified. Only existing attempt shows the same query time of $O(\log n)$, but with a constant factor improvement in space complexity over non-learned methods, under some assumptions on data distribution. In this paper, we significantly strengthen this result, showing that under mild assumptions on data distribution, and the same space complexity as non-learned methods, learned indexes can answer queries in $O(\log\log n)$ expected query time. We also show that allowing for slightly larger but still near-linear space overhead, a learned index can achieve $O(1)$ expected query time. Our results theoretically prove learned indexes are orders of magnitude faster than non-learned methods, theoretically grounding their empirical success.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.