High-dimensional problems have long been considered the Achilles' heel of Bayesian optimization algorithms. Spurred by the curse of dimensionality, a large collection of algorithms aim to make it more performant in this setting, commonly by imposing various simplifying assumptions on the objective. In this paper, we identify the degeneracies that make vanilla Bayesian optimization poorly suited to high-dimensional tasks, and further show how existing algorithms address these degeneracies through the lens of lowering the model complexity. Moreover, we propose an enhancement to the prior assumptions that are typical to vanilla Bayesian optimization algorithms, which reduces the complexity to manageable levels without imposing structural restrictions on the objective. Our modification - a simple scaling of the Gaussian process lengthscale prior with the dimensionality - reveals that standard Bayesian optimization works drastically better than previously thought in high dimensions, clearly outperforming existing state-of-the-art algorithms on multiple commonly considered real-world high-dimensional tasks.
Meshfree simulation methods are emerging as compelling alternatives to conventional mesh-based approaches, particularly in the fields of Computational Fluid Dynamics (CFD) and continuum mechanics. In this publication, we provide a comprehensive overview of our research combining Machine Learning (ML) and Fraunhofer's MESHFREE software (www.meshfree.eu), a powerful tool utilizing a numerical point cloud in a Generalized Finite Difference Method (GFDM). This tool enables the effective handling of complex flow domains, moving geometries, and free surfaces, while allowing users to finely tune local refinement and quality parameters for an optimal balance between computation time and results accuracy. However, manually determining the optimal parameter combination poses challenges, especially for less experienced users. We introduce a novel ML-optimized approach, using active learning, regression trees, and visualization on MESHFREE simulation data, demonstrating the impact of input combinations on results quality and computation time. This research contributes valuable insights into parameter optimization in meshfree simulations, enhancing accessibility and usability for a broader user base in scientific and engineering applications.
We present a mechanized embedding of higher-order logic (HOL) and algebraic data types (ADT) into first-order logic with ZFC axioms. We implement this in the Lisa proof assistant for schematic first-order logic and its library based on axiomatic set theory. HOL proof steps are implemented as proof producing tactics in Lisa, and the types are interpreted as sets, with function (or arrow) types coinciding with set-theoretic function spaces. The embedded HOL proofs, as opposed to being a layer over the existing proofs, are interoperable with the existing library. This yields a form of soft type system supporting top-level polymorphism and ADTs over set theory, and offer tools to reason about functions in set theory.
The interactive theorem prover, Lean, enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
We present an algorithm to solve the dispersive depth-averaged Serre-Green-Naghdi (SGN) equations using patch-based adaptive mesh refinement. These equations require adding additional higher derivative terms to the nonlinear shallow water equations. This has been implemented as a new component of the open source GeoClaw software that is widely used for modeling tsunamis, storm surge, and related hazards, improving its accuracy on shorter wavelength phenomena. We use a formulation that requires solving an elliptic system of equations at each time step, making the method implicit. The adaptive algorithm allows different time steps on different refinement levels, and solves the implicit equations level by level. Computational examples are presented to illustrate the stability and accuracy on a radially symmetric test case and two realistic tsunami modeling problems, including a hypothetical asteroid impact creating a short wavelength tsunami for which dispersive terms are necessary.
With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approach.
We consider metrical task systems on general metric spaces with $n$ points, and show that any fully randomized algorithm can be turned into a randomized algorithm that uses only $2\log n$ random bits, and achieves the same competitive ratio up to a factor $2$. This provides the first order-optimal barely random algorithms for metrical task systems, i.e. which use a number of random bits that does not depend on the number of requests addressed to the system. We put forward an equivalent view that we call collective metrical task systems where $k$ agents in a metrical task system team up, and suffer the average cost paid by each agent. Our results imply that such team can be $O(\log n^2)$-competitive, as soon as $k\geq n^2$ (in comparison, a single agent is $\Omega(n)$-competitive at best). We discuss implications on various aspects of online decision making such as: distributed systems, transaction costs, and advice complexity, suggesting broad applicability.
As emerging digital assets, NFTs are susceptible to anomalous trading behaviors due to the lack of stringent regulatory mechanisms, potentially causing economic losses. In this paper, we conduct the first systematic analysis of four non-fungible tokens (NFT) markets. Specifically, we analyze more than 25 million transactions within these markets, to explore the evolution of wash trade activities. Furthermore, we propose a heuristic algorithm that integrates the network characteristics of transactions with behavioral analysis, to detect wash trading activities in NFT markets. Our findings indicate that NFT markets with incentivized structures exhibit higher proportions of wash trading volume compared to those without incentives. Notably, the LooksRare and X2Y2 markets are detected with wash trading volume proportions as high as 94.5% and 84.2%, respectively.
This study addresses the challenge of forming effective groups in collaborative problem-solving environments. Recognizing the complexity of human interactions and the necessity for efficient collaboration, we propose a novel approach leveraging graph theory and reinforcement learning. Our methodology involves constructing a graph from a dataset where nodes represent participants, and edges signify the interactions between them. We conceptualize each participant as an agent within a reinforcement learning framework, aiming to learn an optimal graph structure that reflects effective group dynamics. Clustering techniques are employed to delineate clear group structures based on the learned graph. Our approach provides theoretical solutions based on evaluation metrics and graph measurements, offering insights into potential improvements in group effectiveness and reductions in conflict incidences. This research contributes to the fields of collaborative work and educational psychology by presenting a data-driven, analytical approach to group formation. It has practical implications for organizational team building, classroom settings, and any collaborative scenario where group dynamics are crucial. The study opens new avenues for exploring the application of graph theory and reinforcement learning in social and behavioral sciences, highlighting the potential for empirical validation in future work.
Recently, a considerable literature has grown up around the theme of Graph Convolutional Network (GCN). How to effectively leverage the rich structural information in complex graphs, such as knowledge graphs with heterogeneous types of entities and relations, is a primary open challenge in the field. Most GCN methods are either restricted to graphs with a homogeneous type of edges (e.g., citation links only), or focusing on representation learning for nodes only instead of jointly propagating and updating the embeddings of both nodes and edges for target-driven objectives. This paper addresses these limitations by proposing a novel framework, namely the Knowledge Embedding based Graph Convolutional Network (KE-GCN), which combines the power of GCNs in graph-based belief propagation and the strengths of advanced knowledge embedding (a.k.a. knowledge graph embedding) methods, and goes beyond. Our theoretical analysis shows that KE-GCN offers an elegant unification of several well-known GCN methods as specific cases, with a new perspective of graph convolution. Experimental results on benchmark datasets show the advantageous performance of KE-GCN over strong baseline methods in the tasks of knowledge graph alignment and entity classification.