Guided by the principles of differential privacy protection the Australian Bureau of Statistics modifies the data summaries from the Australian Census provided through TableBuilder to researchers at approved institutions. This modification algorithm includes the injection of a small degree of artificial noise to every nonzero cell count followed by the suppression of very small cell counts to zero. Researchers working with small area TableBuilder outputs with a high suppression fraction have proposed various algorithmic solutions to reconciling these with less suppressed outputs from larger enclosing areas. Here we propose that a Bayesian, likelihood-based statistical approach in which the perturbation algorithm itself is explicitly represented is well suited to analyses with such randomly perturbed data. Using both real (TableBuilder) and mock datasets representing dwelling classifications in the Perth Greater Capital City Area we demonstrate the feasibility and utility of multi-scale Bayesian reconstruction of modified cell counts in a spatial setting.
In this work, we develop first-order (Hessian-free) and zero-order (derivative-free) implementations of the Cubically regularized Newton method for solving general non-convex optimization problems. For that, we employ finite difference approximations of the derivatives. We use a special adaptive search procedure in our algorithms, which simultaneously fits both the regularization constant and the parameters of the finite difference approximations. It makes our schemes free from the need to know the actual Lipschitz constants. Additionally, we equip our algorithms with the lazy Hessian update that reuse a previously computed Hessian approximation matrix for several iterations. Specifically, we prove the global complexity bound of $\mathcal{O}( n^{1/2} \epsilon^{-3/2})$ function and gradient evaluations for our new Hessian-free method, and a bound of $\mathcal{O}( n^{3/2} \epsilon^{-3/2} )$ function evaluations for the derivative-free method, where $n$ is the dimension of the problem and $\epsilon$ is the desired accuracy for the gradient norm. These complexity bounds significantly improve the previously known ones in terms of the joint dependence on $n$ and $\epsilon$, for the first-order and zeroth-order non-convex optimization.
Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can pose a significant challenge in estimating correlation coefficients. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.
This work considers Bayesian experimental design for the inverse boundary value problem of linear elasticity in a two-dimensional setting. The aim is to optimize the positions of compactly supported pressure activations on the boundary of the examined body in order to maximize the value of the resulting boundary deformations as data for the inverse problem of reconstructing the Lam\'e parameters inside the object. We resort to a linearized measurement model and adopt the framework of Bayesian experimental design, under the assumption that the prior and measurement noise distributions are mutually independent Gaussians. This enables the use of the standard Bayesian A-optimality criterion for deducing optimal positions for the pressure activations. The (second) derivatives of the boundary measurements with respect to the Lam\'e parameters and the positions of the boundary pressure activations are deduced to allow minimizing the corresponding objective function, i.e., the trace of the covariance matrix of the posterior distribution, by a gradient-based optimization algorithm. Two-dimensional numerical experiments are performed to demonstrate the functionality of our approach.
We propose a novel algorithm for solving the composite Federated Learning (FL) problem. This algorithm manages non-smooth regularization by strategically decoupling the proximal operator and communication, and addresses client drift without any assumptions about data similarity. Moreover, each worker uses local updates to reduce the communication frequency with the server and transmits only a $d$-dimensional vector per communication round. We prove that our algorithm converges linearly to a neighborhood of the optimal solution and demonstrate the superiority of our algorithm over state-of-the-art methods in numerical experiments.
Binary responses arise in a multitude of statistical problems, including binary classification, bioassay, current status data problems and sensitivity estimation. There has been an interest in such problems in the Bayesian nonparametrics community since the early 1970s, but inference given binary data is intractable for a wide range of modern simulation-based models, even when employing MCMC methods. Recently, Christensen (2023) introduced a novel simulation technique based on counting permutations, which can estimate both posterior distributions and marginal likelihoods for any model from which a random sample can be generated. However, the accompanying implementation of this technique struggles when the sample size is too large (n > 250). Here we present perms, a new implementation of said technique which is substantially faster and able to handle larger data problems than the original implementation. It is available both as an R package and a Python library. The basic usage of perms is illustrated via two simple examples: a tractable toy problem and a bioassay problem. A more complex example involving changepoint analysis is also considered. We also cover the details of the implementation and illustrate the computational speed gain of perms via a simple simulation study.
With the emergence of Machine Learning, there has been a surge in leveraging its capabilities for problem-solving across various domains. In the code clone realm, the identification of type-4 or semantic clones has emerged as a crucial yet challenging task. Researchers aim to utilize Machine Learning to tackle this challenge, often relying on the BigCloneBench dataset. However, it's worth noting that BigCloneBench, originally not designed for semantic clone detection, presents several limitations that hinder its suitability as a comprehensive training dataset for this specific purpose. Furthermore, CLCDSA dataset suffers from a lack of reusable examples aligning with real-world software systems, rendering it inadequate for cross-language clone detection approaches. In this work, we present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting SemanticCloneBench and OpenAI's GPT-3 model. In particular, using code fragments from SemanticCloneBench as sample inputs along with appropriate prompt engineering for GPT-3 model, we generate semantic and cross-language clones for these specific fragments and then conduct a combination of extensive manual analysis, tool-assisted filtering, functionality testing and automated validation in building the benchmark. From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python). Our benchmark is 15-fold larger than SemanticCloneBench, has more functional code examples for software systems and programming language support than CLCDSA, and overcomes BigCloneBench's qualities, quantification, and language variety limitations.
We provide a new sequent calculus that enjoys syntactic cut-elimination and strongly terminating backward proof search for the intuitionistic Strong L\"ob logic $\sf{iSL}$, an intuitionistic modal logic with a provability interpretation. A novel measure on sequents is used to prove both the termination of the naive backward proof search strategy, and the admissibility of cut in a syntactic and direct way, leading to a straightforward cut-elimination procedure. All proofs have been formalised in the interactive theorem prover Coq.
Graph Neural Networks (GNNs) are becoming increasingly popular due to their superior performance in critical graph-related tasks. While quantization is widely used to accelerate GNN computation, quantized training faces unprecedented challenges. Current quantized GNN training systems often have longer training times than their full-precision counterparts for two reasons: (i) addressing the accuracy challenge leads to excessive overhead, and (ii) the optimization potential exposed by quantization is not adequately leveraged. This paper introduces Tango which re-thinks quantization challenges and opportunities for graph neural network training on GPUs with three contributions: Firstly, we introduce efficient rules to maintain accuracy during quantized GNN training. Secondly, we design and implement quantization-aware primitives and inter-primitive optimizations that can speed up GNN training. Finally, we integrate Tango with the popular Deep Graph Library (DGL) system and demonstrate its superior performance over state-of-the-art approaches on various GNN models and datasets.
Hawkes processes are often applied to model dependence and interaction phenomena in multivariate event data sets, such as neuronal spike trains, social interactions, and financial transactions. In the nonparametric setting, learning the temporal dependence structure of Hawkes processes is generally a computationally expensive task, all the more with Bayesian estimation methods. In particular, for generalised nonlinear Hawkes processes, Monte-Carlo Markov Chain methods applied to compute the doubly intractable posterior distribution are not scalable to high-dimensional processes in practice. Recently, efficient algorithms targeting a mean-field variational approximation of the posterior distribution have been proposed. In this work, we first unify existing variational Bayes approaches under a general nonparametric inference framework, and analyse the asymptotic properties of these methods under easily verifiable conditions on the prior, the variational class, and the nonlinear model. Secondly, we propose a novel sparsity-inducing procedure, and derive an adaptive mean-field variational algorithm for the popular sigmoid Hawkes processes. Our algorithm is parallelisable and therefore computationally efficient in high-dimensional setting. Through an extensive set of numerical simulations, we also demonstrate that our procedure is able to adapt to the dimensionality of the parameter of the Hawkes process, and is partially robust to some type of model mis-specification.
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time, and storage.