Discovering patterns in data that best describe the differences between classes allows to hypothesize and reason about class-specific mechanisms. In molecular biology, for example, this bears promise of advancing the understanding of cellular processes differing between tissues or diseases, which could lead to novel treatments. To be useful in practice, methods that tackle the problem of finding such differential patterns have to be readily interpretable by domain experts, and scalable to the extremely high-dimensional data. In this work, we propose a novel, inherently interpretable binary neural network architecture DIFFNAPS that extracts differential patterns from data. DiffNaps is scalable to hundreds of thousands of features and robust to noise, thus overcoming the limitations of current state-of-the-art methods in large-scale applications such as in biology. We show on synthetic and real world data, including three biological applications, that, unlike its competitors, DiffNaps consistently yields accurate, succinct, and interpretable class descriptions
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance because the distribution of the sample size per class in a dataset is generally exponential unless the sample size is intentionally adjusted. Various methods have been devised to address these problems. Recently, weight balancing, which combines well-known classical regularization techniques with two-stage training, has been proposed. Despite its simplicity, it is known for its high performance compared with existing methods devised in various ways. However, there is a lack of understanding as to why this method is effective for long-tailed data. In this study, we analyze weight balancing by focusing on neural collapse and the cone effect at each training stage and found that it can be decomposed into an increase in Fisher's discriminant ratio of the feature extractor caused by weight decay and cross entropy loss and implicit logit adjustment caused by weight decay and class-balanced loss. Our analysis enables the training method to be further simplified by reducing the number of training stages to one while increasing accuracy.
Numerical models have long been used to understand geoscientific phenomena, including tidal currents, crucial for renewable energy production and coastal engineering. However, their computational cost hinders generating data of varying resolutions. As an alternative, deep learning-based downscaling methods have gained traction due to their faster inference speeds. But most of them are limited to only inference fixed scale and overlook important characteristics of target geoscientific data. In this paper, we propose a novel downscaling framework for tidal current data, addressing its unique characteristics, which are dissimilar to images: heterogeneity and local dependency. Moreover, our framework can generate any arbitrary-scale output utilizing a continuous representation model. Our proposed framework demonstrates significantly improved flow velocity predictions by 93.21% (MSE) and 63.85% (MAE) compared to the Baseline model while achieving a remarkable 33.2% reduction in FLOPs.
Refreshable tactile displays (RTDs) are predicted to soon become a viable option for the provision of accessible graphics for people who are blind or have low vision (BLV). This new technology for the tactile display of braille and graphics, usually using raised pins, makes it easier to generate and access a large number of graphics. However, it differs from existing tactile graphics in terms of scale, height and fidelity. Here, we share the perspectives of four key stakeholders -- blind touch readers, vision specialist teachers, accessible format producers and assistive technology providers -- to explore the potential uses, advantages and needs relating to the introduction of RTDs. We also provide advice on what role the data visualisation community can take to help ensure that people who are BLV are best able to benefit from the introduction of affordable RTDs.
Proximal causal inference is a recently proposed framework for evaluating causal effects in the presence of unmeasured confounding. For point identification of causal effects, it leverages a pair of so-called treatment and outcome confounding proxy variables, to identify a bridge function that matches the dependence of potential outcomes or treatment variables on the hidden factors to corresponding functions of observed proxies. Unique identification of a causal effect via a bridge function crucially requires that proxies are sufficiently relevant for hidden factors, a requirement that has previously been formalized as a completeness condition. However, completeness is well-known not to be empirically testable, and although a bridge function may be well-defined, lack of completeness, sometimes manifested by availability of a single type of proxy, may severely limit prospects for identification of a bridge function and thus a causal effect; therefore, potentially restricting the application of the proximal causal framework. In this paper, we propose partial identification methods that do not require completeness and obviate the need for identification of a bridge function. That is, we establish that proxies of unobserved confounders can be leveraged to obtain bounds on the causal effect of the treatment on the outcome even if available information does not suffice to identify either a bridge function or a corresponding causal effect of interest. Our bounds are non-smooth functionals of the observed data distribution. As a consequence, in the context of inference, we initially provide a smooth approximation of our bounds. Subsequently, we leverage bootstrap confidence intervals on the approximated bounds. We further establish analogous partial identification results in related settings where identification hinges upon hidden mediators for which proxies are available.
The assumption that data are invariant under the action of a compact group is implicit in many statistical modeling assumptions such as normality, or the assumption of independence and identical distributions. Hence, testing for the presence of such invariances offers a principled way to falsify various statistical models. In this article, we develop sequential, anytime-valid tests of distributional symmetry under the action of general compact groups. The tests that are developed allow for the continuous monitoring of data as it is collected while keeping type-I error guarantees, and include tests for exchangeability and rotational symmetry as special examples. The main tool to this end is the machinery developed for conformal prediction. The resulting test statistic, called a conformal martingale, can be interpreted as a likelihood ratio. We use this interpretation to show that the test statistics are optimal -- in a specific log-optimality sense -- against certain alternatives. Furthermore, we draw a connection between conformal prediction, anytime-valid tests of distributional invariance, and current developments on anytime-valid testing. In particular, we extend existing anytime-valid tests of independence, which leverage exchangeability, to work under general group invariances. Additionally, we discuss testing for invariance under subgroups of the permutation group and orthogonal group, the latter of which corresponds to testing the assumptions behind linear regression models.
Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.
We explore a novel methodology for constructing confidence regions for parameters of linear models, using predictions from any arbitrary predictor. Our framework requires minimal assumptions on the noise and can be extended to functions deviating from strict linearity up to some adjustable threshold, thereby accommodating a comprehensive and pragmatically relevant set of functions. The derived confidence regions can be cast as constraints within a Mixed Integer Linear Programming framework, enabling optimisation of linear objectives. This representation enables robust optimization and the extraction of confidence intervals for specific parameter coordinates. Unlike previous methods, the confidence region can be empty, which can be used for hypothesis testing. Finally, we validate the empirical applicability of our method on synthetic data.
Because of their excellent asymptotic and finite-length performance, spatially-coupled (SC) codes are a class of low-density parity-check codes that is gaining increasing attention. Multi-dimensional (MD) SC codes are constructed by connecting copies of an SC code via relocations in order to mitigate various sources of non-uniformity and improve performance in many data storage and data transmission systems. As the number of degrees of freedom in the MD-SC code design increases, appropriately exploiting them becomes more difficult because of the complexity growth of the design process. In this paper, we propose a probabilistic framework for the MD-SC code design, which is based on the gradient-descent (GD) algorithm, to design better MD codes and address this challenge. In particular, we express the expected number of short cycles, which we seek to minimize, in the graph representation of the code in terms of entries of a probability-distribution matrix that characterizes the MD-SC code design. We then find a locally-optimal probability distribution, which serves as the starting point of a finite-length algorithmic optimizer that produces the final MD-SC code. We offer the theoretical analysis as well as the algorithms, and we present experimental results demonstrating that our MD codes, conveniently called GD-MD codes, have notably lower short cycle numbers compared with the available state-of-the-art. Moreover, our algorithms converge on solutions in few iterations, which confirms the complexity reduction as a result of limiting the search space via the locally-optimal GD-MD distributions.
The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems can not easily exploit those storage systems due to cost, throughput, and programming model challenges. We present iFast, a new library-level approach to transparently accelerating scientific applications based on MPI-IO. It decouples application I/O, data caching, and data storage to support heterogeneous storage models. Design decisions of iFast are based on a strong emphasis on deployability. It is highly general with only MPI as a core dependency, allowing users to run unmodified MPI-based applications with unmodified MPI implementations - even proprietary ones like IntelMPI and Cray MPICH. Our approach supports a wide range of networked storage, including traditional PFS, ordinary NFS, and S3-based cloud storage. Unlike previous approaches, iFast ensures crash consistency even across compute nodes. We demonstrate iFast in cloud HPC platform, small local cluster, and hybrid of both to show its generality. Our results show that iFast reduces end-to-end execution time by 13-26% for three popular scientific applications on the cloud. It also outperforms the state-of-the-art system, SymphonyFS, a filesystem-based approach for similar goals but without crash consistency, by 12-23%.
Polar codes are the first class of structured channel codes that achieve the symmetric capacity of binary channels with efficient encoding and decoding. In 2019, Arikan proposed a new polar coding scheme referred to as polarization-adjusted convolutional (PAC)} codes. In contrast to polar codes, PAC codes precode the information word using a convolutional code prior to polar encoding. This results in material coding gain over polar code under Fano sequential decoding as well as successive cancellation list (SCL) decoding. Given the advantages of SCL decoding over Fano decoding in certain scenarios such as low-SNR regime or where a constraint on the worst case decoding latency exists, in this paper, we focus on SCL decoding and present a simplified SCL (SSCL) decoding algorithm for PAC codes. SSCL decoding of PAC codes reduces the decoding latency by identifying special nodes in the decoding tree and processing them at the intermediate stages of the graph. Our simulation results show that the performance of PAC codes under SSCL decoding is almost similar to the SCL decoding while having lower decoding latency.