Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an area does not have the maturity of genomic compression, lacking an homogeneous and methodologically sound experimental foundation that allows to fairly compare the relative merits of the available solutions, and that takes into account also the rich choices of compression methods that can be used. Results: We provide such a foundation here, supporting it with an extensive set of experiments that use reference datasets and a carefully selected set of representative data compressors. Our results highlight the spectrum of compressor choices one has in terms of Pareto Optimality of compression vs. post-processing, this latter being important when the Dictionary needs to be decompressed many times. In addition to the useful indications, not available elsewhere, that this study offers to the researchers interested in storing k-mer dictionaries in compressed form, a software system that can be readily used to explore the Pareto Optimal solutions available r a given Dictionary is also provided. Availability: The software system is available at //github.com/GenGrim76/Pareto-Optimal-GDC, together with user manuals and installation instructions. Contact: [email protected] Supplementary information: Additional data are available in the Supplementary Material.
Virtual reality (VR) is known to cause a "time compression" effect, where the time spent in VR feels to pass faster than the effective elapsed time. Our goal with this research is to investigate if the physical realism of a VR experience reduces the time compression effect on a gas monitoring training task that requires precise time estimation. We used physical props and passive haptics in a VR task with high physical realism and compared it to an equivalent standard VR task with only virtual objects. We also used an identical real-world task as a baseline time estimation task. Each scenario includes the user picking up a device, opening a door, navigating a corridor with obstacles, performing five short time estimations, and estimating the total time from task start to end. Contrary to previous work, there was a consistent time dilation effect in all conditions, including the real world. However, no significant effects were found comparing the estimated differences between the high and low physical realism conditions. We discuss implications of the results and limitations of the study and propose future work that may better address this important question for virtual reality training.
We analyze to what extent final users can infer information about the level of protection of their data when the data obfuscation mechanism is a priori unknown to them (the so-called ''black-box'' scenario). In particular, we delve into the investigation of two notions of local differential privacy (LDP), namely {\epsilon}-LDP and R\'enyi LDP. On one hand, we prove that, without any assumption on the underlying distributions, it is not possible to have an algorithm able to infer the level of data protection with provable guarantees; this result also holds for the central versions of the two notions of DP considered. On the other hand, we demonstrate that, under reasonable assumptions (namely, Lipschitzness of the involved densities on a closed interval), such guarantees exist and can be achieved by a simple histogram-based estimator. We validate our results experimentally and we note that, on a particularly well-behaved distribution (namely, the Laplace noise), our method gives even better results than expected, in the sense that in practice the number of samples needed to achieve the desired confidence is smaller than the theoretical bound, and the estimation of {\epsilon} is more precise than predicted.
In this paper, a novel framework is established for uncertainty quantification via information bottleneck (IB-UQ) for scientific machine learning tasks, including deep neural network (DNN) regression and neural operator learning (DeepONet). Specifically, we first employ the General Incompressible-Flow Networks (GIN) model to learn a "wide" distribution fromnoisy observation data. Then, following the information bottleneck objective, we learn a stochastic map from input to some latent representation that can be used to predict the output. A tractable variational bound on the IB objective is constructed with a normalizing flow reparameterization. Hence, we can optimize the objective using the stochastic gradient descent method. IB-UQ can provide both mean and variance in the label prediction by explicitly modeling the representation variables. Compared to most DNN regression methods and the deterministic DeepONet, the proposed model can be trained on noisy data and provide accurate predictions with reliable uncertainty estimates on unseen noisy data. We demonstrate the capability of the proposed IB-UQ framework via several representative examples, including discontinuous function regression, real-world dataset regression and learning nonlinear operators for diffusion-reaction partial differential equation.
The automated detection of cancerous tumors has attracted interest mainly during the last decade, due to the necessity of early and efficient diagnosis that will lead to the most effective possible treatment of the impending risk. Several machine learning and artificial intelligence methodologies has been employed aiming to provide trustworthy helping tools that will contribute efficiently to this attempt. In this article, we present a low-complexity convolutional neural network architecture for tumor classification enhanced by a robust image augmentation methodology. The effectiveness of the presented deep learning model has been investigated based on 3 datasets containing brain, kidney and lung images, showing remarkable diagnostic efficiency with classification accuracies of 99.33%, 100% and 99.7% for the 3 datasets respectively. The impact of the augmentation preprocessing step has also been extensively examined using 4 evaluation measures. The proposed low-complexity scheme, in contrast to other models in the literature, renders our model quite robust to cases of overfitting that typically accompany small datasets frequently encountered in medical classification challenges. Finally, the model can be easily re-trained in case additional volume images are included, as its simplistic architecture does not impose a significant computational burden.
Backward Stochastic Differential Equations (BSDEs) have been widely employed in various areas of social and natural sciences, such as the pricing and hedging of financial derivatives, stochastic optimal control problems, optimal stopping problems and gene expression. Most BSDEs cannot be solved analytically and thus numerical methods must be applied to approximate their solutions. There have been a variety of numerical methods proposed over the past few decades as well as many more currently being developed. For the most part, they exist in a complex and scattered manner with each requiring a variety of assumptions and conditions. The aim of the present work is thus to systematically survey various numerical methods for BSDEs, and in particular, compare and categorize them, for further developments and improvements. To achieve this goal, we focus primarily on the core features of each method based on an extensive collection of 333 references: the main assumptions, the numerical algorithm itself, key convergence properties and advantages and disadvantages, to provide an up-to-date coverage of numerical methods for BSDEs, with insightful summaries of each and a useful comparison and categorization.
Many real-world systems can be described by mathematical formulas that are human-comprehensible, easy to analyze and can be helpful in explaining the system's behaviour. Symbolic regression is a method that generates nonlinear models from data in the form of analytic expressions. Historically, symbolic regression has been predominantly realized using genetic programming, a method that iteratively evolves a population of candidate solutions that are sampled by genetic operators crossover and mutation. This gradient-free evolutionary approach suffers from several deficiencies: it does not scale well with the number of variables and samples in the training data, models tend to grow in size and complexity without an adequate accuracy gain, and it is hard to fine-tune the inner model coefficients using just genetic operators. Recently, neural networks have been applied to learn the whole analytic formula, i.e., its structure as well as the coefficients, by means of gradient-based optimization algorithms. We propose a novel neural network-based symbolic regression method that constructs physically plausible models based on limited training data and prior knowledge about the system. The method employs an adaptive weighting scheme to effectively deal with multiple loss function terms and an epoch-wise learning process to reduce the chance of getting stuck in poor local optima. Furthermore, we propose a parameter-free method for choosing the model with the best interpolation and extrapolation performance out of all models generated through the whole learning process. We experimentally evaluate the approach on the TurtleBot 2 mobile robot, the magnetic manipulation system, the equivalent resistance of two resistors in parallel, and the anti-lock braking system. The results clearly show the potential of the method to find sparse and accurate models that comply with the prior knowledge provided.
Motivated by applications to COVID dynamics, we describe a branching process in random environments model $\{Z_n\}$ whose characteristics change when crossing upper and lower thresholds. This introduces a cyclical path behavior involving periods of increase and decrease leading to supercritical and subcritical regimes. Even though the process is not Markov, we identify subsequences at random time points $\{(\tau_j, \nu_j)\}$ - specifically the values of the process at crossing times, {\it{viz.}}, $\{(Z_{\tau_j}, Z_{\nu_j})\}$ - along which the process retains the Markov structure. Under mild moment and regularity conditions, we establish that the subsequences possess a regenerative structure and prove that the limiting normal distribution of the growth rates of the process in supercritical and subcritical regimes decouple. For this reason, we establish limit theorems concerning the length of supercritical and subcritical regimes and the proportion of time the process spends in these regimes. As a byproduct of our analysis, we explicitly identify the limiting variances in terms of the functionals of the offspring distribution, threshold distribution, and environmental sequences.
A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the convergence of causal inference and language processing. Still, research on causality in NLP remains scattered across domains without unified definitions, benchmark datasets and clear articulations of the remaining challenges. In this survey, we consolidate research across academic areas and situate it in the broader NLP landscape. We introduce the statistical challenge of estimating causal effects, encompassing settings where text is used as an outcome, treatment, or as a means to address confounding. In addition, we explore potential uses of causal inference to improve the performance, robustness, fairness, and interpretability of NLP models. We thus provide a unified overview of causal inference for the computational linguistics community.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.
Deep convolutional neural networks (CNNs) have recently achieved great success in many visual recognition tasks. However, existing deep neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past few years, tremendous progress has been made in this area. In this paper, we survey the recent advanced techniques for compacting and accelerating CNNs model developed. These techniques are roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Methods of parameter pruning and sharing will be described at the beginning, after that the other techniques will be introduced. For each scheme, we provide insightful analysis regarding the performance, related applications, advantages, and drawbacks etc. Then we will go through a few very recent additional successful methods, for example, dynamic capacity networks and stochastic depths networks. After that, we survey the evaluation matrix, the main datasets used for evaluating the model performance and recent benchmarking efforts. Finally, we conclude this paper, discuss remaining challenges and possible directions on this topic.