The context of this paper is the creation of large uniform archaeological datasets from heterogeneous published resources, such as find catalogues - with the help of AI and Big Data. The paper is concerned with the challenge of consistent assemblages of archaeological data. We cannot simply combine existing records, as they differ in terms of quality and recording standards. Thus, records have to be recreated from published archaeological illustrations. This is only a viable path with the help of automation. The contribution of this paper is a new workflow for collecting data from archaeological find catalogues available as legacy resources, such as archaeological drawings and photographs in large unsorted PDF files; the workflow relies on custom software (AutArch) supporting image processing, object detection, and interactive means of validating and adjusting automatically retrieved data. We integrate artificial intelligence (AI) in terms of neural networks for object detection and classification into the workflow, thereby speeding up, automating, and standardising data collection. Objects commonly found in archaeological catalogues - such as graves, skeletons, ceramics, ornaments, stone tools and maps - are detected. Those objects are spatially related and analysed to extract real-life attributes, such as the size and orientation of graves based on the north arrow and the scale. We also automate recording of geometric whole-outlines through contour detection, as an alternative to landmark-based geometric morphometrics. Detected objects, contours, and other automatically retrieved data can be manually validated and adjusted. We use third millennium BC Europe (encompassing cultures such as 'Corded Ware' and 'Bell Beaker', and their burial practices) as a 'testing ground' and for evaluation purposes; this includes a user study for the workflow and the AutArch software.
Advances in high-throughput technologies have originated an ever-increasing availability of omics datasets. The integration of multiple heterogeneous data sources is currently an issue for biology and bioinformatics. Multiple kernel learning (MKL) has shown to be a flexible and valid approach to consider the diverse nature of multi-omics inputs, despite being an underused tool in genomic data mining.We provide novel MKL approaches based on different kernel fusion strategies.To learn from the meta-kernel of input kernels, we adaptedunsupervised integration algorithms for supervised tasks with support vector machines.We also tested deep learning architectures for kernel fusion and classification.The results show that MKL-based models can compete with more complex, state-of-the-art, supervised multi-omics integrative approaches. Multiple kernel learning offers a natural framework for predictive models in multi-omics genomic data. Our results offer a direction for bio-data mining research and further development of methods for heterogeneous data integration.
This scientific report presents a novel methodology for the early prediction of important political events using News datasets. The methodology leverages natural language processing, graph theory, clique analysis, and semantic relationships to uncover hidden predictive signals within the data. Initially, we designed a preliminary version of the method and tested it on a few events. This analysis revealed limitations in the initial research phase. We then enhanced the model in two key ways: first, we added a filtration step to only consider politically relevant news before further processing; second, we adjusted the input features to make the alert system more sensitive to significant spikes in the data. After finalizing the improved methodology, we tested it on eleven events including US protests, the Ukraine war, and French protests. Results demonstrate the superiority of our approach compared to baseline methods. Through targeted refinements, our model can now provide earlier and more accurate predictions of major political events based on subtle patterns in news data.
A simple greedy algorithm to find a maximal independent set (MIS) in a graph starts with the empty set and visits every vertex, adding it to the set if and only if none of its neighbours are already in the set. In this paper, we consider the generalisation of this MIS algorithm by letting it start with any set of vertices and we prove the hardness of many decision problems related to this generalisation. Our results are based on two main strategies. Firstly, we view the MIS algorithm as a sequential update of a Boolean network, which we refer to as the MIS network, according to a permutation of the vertex set. The set of fixed points of the MIS network corresponds to the set of MIS of the graph. Our generalisation then consists in starting from any configuration and following a sequential update given by a word of vertices. Secondly, we introduce the concept of a colony of a graph, that is a set of vertices that is dominated by an independent set. Deciding whether a set of vertices is a colony is NP-complete; decision problems related to the MIS algorithm will be reduced from the Colony problem. We first show that deciding whether a configuration can reach all maximal independent sets is coNP-complete. Second, we consider so-called fixing words, that allow to reach a MIS for any initial configuration, and fixing permutations, which we call permises; deciding whether a permutation is fixing is coNP-complete. Third, we show that deciding whether a graph has a permis is coNP-hard. Finally, we generalise the MIS algorithm to digraphs. The algorithm then uses the so-called kernel network, whose fixed points are the kernels of the digraph. Deciding whether the kernel network of a given digraph is fixable is coNP-hard, even for digraphs that have a kernel. Alternatively, we introduce two fixable Boolean networks whose sets of fixed points contain all kernels.
This research's primary motivation of this study is to address the high hardware and computational demands typically associated with LLMs.Therefore,our goal is to find a balance between model lightness and performance,striving to maximize performance while using a comparatively lightweight model. Hyacinth6B was developed with this objective in mind,aiming to fully leverage the core capabilities of LLMs without incurring substantial resource costs, effectively pushing the boundaries of smaller model's performance. The training approach involves parameter efficient finetuning using the LoRA method.
Quantifying the heterogeneity is an important issue in meta-analysis, and among the existing measures, the $I^2$ statistic is the most commonly used measure in the literature. In this paper, we show that the $I^2$ statistic was, in fact, defined as problematic or even completely wrong from the very beginning. To confirm this statement, we first present a motivating example to show that the $I^2$ statistic is heavily dependent on the study sample sizes, and consequently it may yield contradictory results for the amount of heterogeneity. Moreover, by drawing a connection between ANOVA and meta-analysis, the $I^2$ statistic is shown to have, mistakenly, applied the sampling errors of the estimators rather than the variances of the study populations. Inspired by this, we introduce an Intrinsic measure for Quantifying the heterogeneity in meta-analysis, and meanwhile study its statistical properties to clarify why it is superior to the existing measures. We further propose an optimal estimator, referred to as the IQ statistic, for the new measure of heterogeneity that can be readily applied in meta-analysis. Simulations and real data analysis demonstrate that the IQ statistic provides a nearly unbiased estimate of the true heterogeneity and it is also independent of the study sample sizes.
Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at //github.com/microsoft/BatteryML.
In this paper, we derive high-dimensional asymptotic properties of the Moore-Penrose inverse and the ridge-type inverse of the sample covariance matrix. In particular, the analytical expressions of the weighted sample trace moments are deduced for both generalized inverse matrices and are present by using the partial exponential Bell polynomials which can easily be computed in practice. The existent results are extended in several directions: (i) First, the population covariance matrix is not assumed to be a multiple of the identity matrix; (ii) Second, the assumption of normality is not used in the derivation; (iii) Third, the asymptotic results are derived under the high-dimensional asymptotic regime. Our findings are used to construct improved shrinkage estimators of the precision matrix, which asymptotically minimize the quadratic loss with probability one. Finally, the finite sample properties of the derived theoretical results are investigated via an extensive simulation study.
In this paper, we present SIMAP, a novel layer integrated into deep learning models, aimed at enhancing the interpretability of the output. The SIMAP layer is an enhanced version of Simplicial-Map Neural Networks (SMNNs), an explainable neural network based on support sets and simplicial maps (functions used in topology to transform shapes while preserving their structural connectivity). The novelty of the methodology proposed in this paper is two-fold: Firstly, SIMAP layers work in combination with other deep learning architectures as an interpretable layer substituting classic dense final layers. Secondly, unlike SMNNs, the support set is based on a fixed maximal simplex, the barycentric subdivision being efficiently computed with a matrix-based multiplication algorithm.
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.