Docker allows for the packaging of applications and dependencies, and its instructions are described in Dockerfiles. Nowadays, version pinning is recommended to avoid unexpected changes in the latest version of a package. However, version pinning in Dockerfiles is not yet fully realized (only 17k of the 141k Dockerfiles we analyzed), because of the difficulties caused by version pinning. To maintain Dockerfiles with version-pinned packages, it is important to update package versions, not only for improved functionality, but also for software supply chain security, as packages are changed to address vulnerabilities and bug fixes. However, when updating multiple version-pinned packages, it is necessary to understand the dependencies between packages and ensure version compatibility, which is not easy. To address this issue, we explore the applicability of the meta-maintenance approach, which aims to distribute the successful updates in a part of a group that independently maintains a common artifact. We conduct an exploratory analysis of 7,914 repositories on GitHub that hold Dockerfiles, which retrieve packages on GitHub by URLs. There were 385 repository groups with the same multiple package combinations, and 208 groups had Dockerfiles with newer version combinations compared to others, which are considered meta-maintenance applicable. Our findings support the potential of meta-maintenance for updating multiple version-pinned packages and also reveal future challenges.
R is a language and environment for statistical computing and graphics, which provides a wide variety of statistical tools (modeling, statistical testing, time series analysis, classification problems, machine learning, ...), together with amazing graphical techniques and the great advantage that it is highly extensible. Nowadays, there is no doubt that it is the software par excellence in statistical courses for any level, for theoretical and applied subjects alike. Besides, it has become an almost essential tool for every research work that involves any kind of analysis or data visualization. Furthermore, it is one of the most employed programming languages for general purposes. The goal of this work is helping to share ideas and resources to improve teaching and/or research using the statistical software R. We will cover its benefits, show how to get started and where to locate specific resources, and will make interesting recommendations for using R, according to our experience. For the classroom we will develop a curricular and assessment infrastructure to support both dissemination and evaluation, while for research we will offer a broader approach to quantitative studies that provides an excellent support for work in science and technology.
Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are still limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace $SO(3)$ convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable $S^2$ activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on the large-scale OC20 dataset by up to $12\%$ on forces, $4\%$ on energies, offers better speed-accuracy trade-offs, and $2\times$ reduction in DFT calculations needed for computing adsorption energies.
The integration of permissioned blockchain such as Hyperledger fabric (HF) and Industrial internet of Things (IIoT) has opened new opportunities for interdependent supply chain partners to improve their performance through data sharing and coordination. The multichannel mechanism, private data collection and querying mechanism of HF enable private data sharing, transparency, traceability, and verification across the supply chain. However, the existing querying mechanism of HF needs further improvement for statistical data sharing because the query is evaluated on the original data recorded on the ledger. As a result, it gives rise to privacy issues such as leak of business secrets, tracking of resources and assets, and disclose of personal information. Therefore, we solve this problem by proposing a differentially private enhanced permissioned blockchain for private data sharing in the context of supply chain in IIoT which is known as (EDH-IIoT). We propose algorithms to efficiently utilize the $\epsilon$ through the reuse of the privacy budget for the repeated queries. Furthermore, the reuse and tracking of $\epsilon$ enable the data owner to get ensure that $\epsilon$ does not exceed the threshold which is the maximum privacy budget ($\epsilon_{t}$). Finally, we model two privacy attacks namely linking attack and composition attack to evaluate and compare privacy preservation, and the efficiency of reuse of {\epsilon} with the default chaincode of HF and traditional differential privacy model, respectively. The results confirm that EDH-IIoT obtains an accuracy of 97% in the shared data for $\epsilon_{t}$ = 1, and a reduction of 35.96% in spending of $\epsilon$.
We present an implicit-explicit finite volume scheme for two-fluid single-temperature flow in all Mach number regimes which is based on a symmetric hyperbolic thermodynamically compatible description of the fluid flow. The scheme is stable for large time steps controlled by the interface transport and is computational efficient due to a linear implicit character. The latter is achieved by linearizing along constant reference states given by the asymptotic analysis of the single-temperature model. Thus, the use of a stiffly accurate IMEX Runge Kutta time integration and the centered treatment of pressure based quantities provably guarantee the asymptotic preserving property of the scheme for weakly compressible Euler equations with variable volume fraction. The properties of the first and second order scheme are validated by several numerical test cases.
Warm-Start reinforcement learning (RL), aided by a prior policy obtained from offline training, is emerging as a promising RL approach for practical applications. Recent empirical studies have demonstrated that the performance of Warm-Start RL can be improved \textit{quickly} in some cases but become \textit{stagnant} in other cases, especially when the function approximation is used. To this end, the primary objective of this work is to build a fundamental understanding on ``\textit{whether and when online learning can be significantly accelerated by a warm-start policy from offline RL?}''. Specifically, we consider the widely used Actor-Critic (A-C) method with a prior policy. We first quantify the approximation errors in the Actor update and the Critic update, respectively. Next, we cast the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance with inaccurate Actor/Critic updates. Under some general technical conditions, we derive the upper bounds, which shed light on achieving the desired finite-learning performance in the Warm-Start A-C algorithm. In particular, our findings reveal that it is essential to reduce the algorithm bias in online learning. We also obtain lower bounds on the sub-optimality gap of the Warm-Start A-C algorithm to quantify the impact of the bias and error propagation.
In the investment industry, it is often essential to carry out fine-grained company similarity quantification for a range of purposes, including market mapping, competitor analysis, and mergers and acquisitions. We propose and publish a knowledge graph, named CompanyKG, to represent and learn diverse company features and relations. Specifically, 1.17 million companies are represented as nodes enriched with company description embeddings; and 15 different inter-company relations result in 51.06 million weighted edges. To enable a comprehensive assessment of methods for company similarity quantification, we have devised and compiled three evaluation tasks with annotated test sets: similarity prediction, competitor retrieval and similarity ranking. We present extensive benchmarking results for 11 reproducible predictive methods categorized into three groups: node-only, edge-only, and node+edge. To the best of our knowledge, CompanyKG is the first large-scale heterogeneous graph dataset originating from a real-world investment platform, tailored for quantifying inter-company similarity.
Although existing techniques have proposed automated approaches to alleviate the path explosion problem of symbolic execution, users still need to optimize symbolic execution by applying various searching strategies carefully. As existing approaches mainly support only coarse-grained global searching strategies, they cannot efficiently traverse through complex code structures. In this paper, we propose Eunomia, a symbolic execution technique that allows users to specify local domain knowledge to enable fine-grained search. In Eunomia, we design an expressive DSL, Aes, that lets users precisely pinpoint local searching strategies to different parts of the target program. To further optimize local searching strategies, we design an interval-based algorithm that automatically isolates the context of variables for different local searching strategies, avoiding conflicts between local searching strategies for the same variable. We implement Eunomia as a symbolic execution platform targeting WebAssembly, which enables us to analyze applications written in various languages (like C and Go) but can be compiled into WebAssembly. To the best of our knowledge, Eunomia is the first symbolic execution engine that supports the full features of the WebAssembly runtime. We evaluate Eunomia with a dedicated microbenchmark suite for symbolic execution and six real-world applications. Our evaluation shows that Eunomia accelerates bug detection in real-world applications by up to three orders of magnitude. According to the results of a comprehensive user study, users can significantly improve the efficiency and effectiveness of symbolic execution by writing a simple and intuitive Aes script. Besides verifying six known real-world bugs, Eunomia also detected two new zero-day bugs in a popular open-source project, Collections-C.
With various AI tools such as ChatGPT becoming increasingly popular, we are entering a true AI era. We can foresee that exceptional AI tools will soon reap considerable profits. A crucial question arise: should AI tools share revenue with their training data providers in additional to traditional stakeholders and shareholders? The answer is Yes. Large AI tools, such as large language models, always require more and better quality data to continuously improve, but current copyright laws limit their access to various types of data. Sharing revenue between AI tools and their data providers could transform the current hostile zero-sum game relationship between AI tools and a majority of copyrighted data owners into a collaborative and mutually beneficial one, which is necessary to facilitate the development of a virtuous cycle among AI tools, their users and data providers that drives forward AI technology and builds a healthy AI ecosystem. However, current revenue-sharing business models do not work for AI tools in the forthcoming AI era, since the most widely used metrics for website-based traffic and action, such as clicks, will be replaced by new metrics such as prompts and cost per prompt for generative AI tools. A completely new revenue-sharing business model, which must be almost independent of AI tools and be easily explained to data providers, needs to establish a prompt-based scoring system to measure data engagement of each data provider. This paper systematically discusses how to build such a scoring system for all data providers for AI tools based on classification and content similarity models, and outlines the requirements for AI tools or third parties to build it. Sharing revenue with data providers using such a scoring system would encourage more data owners to participate in the revenue-sharing program. This will be a utilitarian AI era where all parties benefit.
Explainable Information Retrieval (XIR) is a growing research area focused on enhancing transparency and trustworthiness of the complex decision-making processes taking place in modern information retrieval systems. While there has been progress in developing XIR systems, empirical evaluation tools to assess the degree of explainability attained by such systems are lacking. To close this gap and gain insights into the true merit of XIR systems, we extend existing insights from a factor analysis of search explainability to introduce SSE (Search System Explainability), an evaluation metric for XIR search systems. Through a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. Additionally, we observe comparable perceived temporal demand and performance levels between non-native and native English speakers. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.
Two numerical schemes are proposed and investigated for the Yang--Mills equations, which can be seen as a nonlinear generalisation of the Maxwell equations set on Lie algebra-valued functions, with similarities to certain formulations of General Relativity. Both schemes are built on the Discrete de Rham (DDR) method, and inherit from its main features: an arbitrary order of accuracy, and applicability to generic polyhedral meshes. They make use of the complex property of the DDR, together with a Lagrange-multiplier approach, to preserve, at the discrete level, a nonlinear constraint associated with the Yang--Mills equations. We also show that the schemes satisfy a discrete energy dissipation (the dissipation coming solely from the implicit time stepping). Issues around the practical implementations of the schemes are discussed; in particular, the assembly of the local contributions in a way that minimises the price we pay in dealing with nonlinear terms, in conjunction with the tensorisation coming from the Lie algebra. Numerical tests are provided using a manufactured solution, and show that both schemes display a convergence in $L^2$-norm of the potential and electrical fields in $\mathcal O(h^{k+1})$ (provided that the time step is of that order), where $k$ is the polynomial degree chosen for the DDR complex. We also numerically demonstrate the preservation of the constraint.