Physical experiments in the national security domain are often expensive and time-consuming. Test engineers must certify the compatibility of aircraft and their weapon systems before they can be deployed in the field, but the testing required is time consuming, expensive, and resource limited. Adopting Bayesian adaptive designs are a promising way to borrow from the successes seen in the clinical trials domain. The use of predictive probability (PP) to stop testing early and make faster decisions is particularly appealing given the aforementioned constraints. Given the high-consequence nature of the tests performed in the national security space, a strong understanding of new methods is required before being deployed. Although PP has been thoroughly studied for binary data, there is less work with continuous data, which often in reliability studies interested in certifying the specification limits of components. A simulation study evaluating the robustness of this approach indicate early stopping based on PP is reasonably robust to minor assumption violations, especially when only a few interim analyses are conducted. A post-hoc analysis exploring whether release requirements of a weapon system from an aircraft are within specification with desired reliability resulted in stopping the experiment early and saving 33% of the experimental runs.
Safety assurance is of paramount importance across various domains, including automotive, aerospace, and nuclear energy, where the reliability and acceptability of mission-critical systems are imperative. This assurance is effectively realized through the utilization of Safety Assurance Cases. The use of safety assurance cases allows for verifying the correctness of the created systems capabilities, preventing system failure. The latter may result in loss of life, severe injuries, large-scale environmental damage, property destruction, and major economic loss. Still, the emergence of complex technologies such as cyber-physical systems (CPSs), characterized by their heterogeneity, autonomy, machine learning capabilities, and the uncertainty of their operational environments poses significant challenges for safety assurance activities. Several papers have tried to propose solutions to tackle these challenges, but to the best of our knowledge, no secondary study investigates the trends, patterns, and relationships characterizing the safety case scientific literature. This makes it difficult to have a holistic view of the safety case landscape and to identify the most promising future research directions. In this paper, we, therefore, rely on state-of-the-art bibliometric tools(e.g., VosViewer) to conduct a bibliometric analysis that allows us to generate valuable insights, identify key authors and venues, and gain a birds eye view of the current state of research in the safety assurance area. By revealing knowledge gaps and highlighting potential avenues for future research, our analysis provides an essential foundation for researchers, corporate safety analysts, and regulators seeking to embrace or enhance safety practices that align with their specific needs and objectives.
Static analysis tools are commonly used to detect defects before the code is released. Previous research has focused on their overall effectiveness and their ability to detect defects. However, little is known about the usage patterns of warning suppressions: the configurations developers set up in order to prevent the appearance of specific warnings. We address this gap by analyzing how often are warning suppression features used, which warning suppression features are used and for what purpose, and also how could the use of warning suppression annotations be avoided. To answer these questions we examine 1\,425 open-source Java-based projects that utilize Findbugs or Spotbugs for warning-suppressing configurations and source code annotations. We find that although most warnings are suppressed, only a small portion of them get frequently suppressed. Contrary to expectations, false positives account for a minor proportion of suppressions. A significant number of suppressions introduce technical debt, suggesting potential disregard for code quality or a lack of appropriate guidance from the tool. Misleading suggestions and incorrect assumptions also lead to suppressions. Findings underscore the need for better communication and education related to the use of static analysis tools, improved bug pattern definitions, and better code annotation. Future research can extend these findings to other static analysis tools, and apply them to improve the effectiveness of static analysis.
The adoption of cyber-physical systems (CPS) is on the rise in complex physical environments, encompassing domains such as autonomous vehicles, the Internet of Things (IoT), and smart cities. A critical attribute of CPS is robustness, denoting its capacity to operate safely despite potential disruptions and uncertainties in the operating environment. This paper proposes a novel specification-based robustness, which characterizes the effectiveness of a controller in meeting a specified system requirement, articulated through Signal Temporal Logic (STL) while accounting for possible deviations in the system. This paper also proposes the robustness falsification problem based on the definition, which involves identifying minor deviations capable of violating the specified requirement. We present an innovative two-layer simulation-based analysis framework designed to identify subtle robustness violations. To assess our methodology, we devise a series of benchmark problems wherein system parameters can be adjusted to emulate various forms of uncertainties and disturbances. Initial evaluations indicate that our falsification approach proficiently identifies robustness violations, providing valuable insights for comparing robustness between conventional and reinforcement learning (RL)-based controllers
Mathematical formulas serve as the means of communication between humans and nature, encapsulating the operational laws governing natural phenomena. The concise formulation of these laws is a crucial objective in scientific research and an important challenge for artificial intelligence (AI). While traditional artificial neural networks (MLP) excel at data fitting, they often yield uninterpretable black box results that hinder our understanding of the relationship between variables x and predicted values y. Moreover, the fixed network architecture in MLP often gives rise to redundancy in both network structure and parameters. To address these issues, we propose MetaSymNet, a novel neural network that dynamically adjusts its structure in real-time, allowing for both expansion and contraction. This adaptive network employs the PANGU meta function as its activation function, which is a unique type capable of evolving into various basic functions during training to compose mathematical formulas tailored to specific needs. We then evolve the neural network into a concise, interpretable mathematical expression. To evaluate MetaSymNet's performance, we compare it with four state-of-the-art symbolic regression algorithms across more than 10 public datasets comprising 222 formulas. Our experimental results demonstrate that our algorithm outperforms others consistently regardless of noise presence or absence. Furthermore, we assess MetaSymNet against MLP and SVM regarding their fitting ability and extrapolation capability, these are two essential aspects of machine learning algorithms. The findings reveal that our algorithm excels in both areas. Finally, we compared MetaSymNet with MLP using iterative pruning in network structure complexity. The results show that MetaSymNet's network structure complexity is obviously less than MLP under the same goodness of fit.
The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.
This paper aims to address a critical challenge in robotics, which is enabling them to operate seamlessly in human environments through natural language interactions. Our primary focus is to equip robots with the ability to understand and execute complex instructions in coherent dialogs to facilitate intricate task-solving scenarios. To explore this, we build upon the Execution from Dialog History (EDH) task from the Teach benchmark. We employ a multi-transformer model with BART LM. We observe that our best configuration outperforms the baseline with a success rate score of 8.85 and a goal-conditioned success rate score of 14.02. In addition, we suggest an alternative methodology for completing this task. Moreover, we introduce a new task by expanding the EDH task and making predictions about game plans instead of individual actions. We have evaluated multiple BART models and an LLaMA2 LLM, which has achieved a ROGUE-L score of 46.77 for this task.
Although the synthesis of programs encoding policies often carries the promise of interpretability, systematic evaluations to assess the interpretability of these policies were never performed, likely because of the complexity of such an evaluation. In this paper, we introduce a novel metric that uses large-language models (LLM) to assess the interpretability of programmatic policies. For our metric, an LLM is given both a program and a description of its associated programming language. The LLM then formulates a natural language explanation of the program. This explanation is subsequently fed into a second LLM, which tries to reconstruct the program from the natural language explanation. Our metric measures the behavioral similarity between the reconstructed program and the original. We validate our approach using obfuscated programs that are used to solve classic programming problems. We also assess our metric with programmatic policies synthesized for playing a real-time strategy game, comparing the interpretability scores of programmatic policies synthesized by an existing system to lightly obfuscated versions of the same programs. Our LLM-based interpretability score consistently ranks less interpretable programs lower and more interpretable ones higher. These findings suggest that our metric could serve as a reliable and inexpensive tool for evaluating the interpretability of programmatic policies.
We explain how to use Kolmogorov Superposition Theorem (KST) to break the curse of dimensionality when approximating a dense class of multivariate continuous functions. We first show that there is a class of functions called $K$-Lipschitz continuous in $C([0,1]^d)$ which can be approximated by a special ReLU neural network of two hidden layers with a dimension independent approximation rate $O(n^{-1})$ with approximation constant increasing quadratically in $d$. The number of parameters used in such neural network approximation equals to $(6d+2)n$. Next we introduce KB-splines by using linear B-splines to replace the K-outer function and smooth the KB-splines to have the so-called LKB-splines as the basis for approximation. Our numerical evidence shows that the curse of dimensionality is broken in the following sense: When using the standard discrete least squares (DLS) method to approximate a continuous function, there exists a pivotal set of points in $[0,1]^d$ with size at most $O(nd)$ such that the rooted mean squares error (RMSE) from the DLS based on the pivotal set is similar to the RMSE of the DLS based on the original set with size $O(n^d)$. In addition, by using matrix cross approximation technique, the number of LKB-splines used for approximation is the same as the size of the pivotal data set. Therefore, we do not need too many basis functions as well as too many function values to approximate a high dimensional continuous function $f$.
When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on dynamic programming and on local interaction rules. We present a theoretical analysis of both approaches and our results are extensively validated using simulations. Interestingly, our results demonstrate that the viability of heterogeneous teams depends on the amount of information available to the defenders. Moreover, our results suggest a universality property: across a wide range of problem parameters the optimal ratio of the speeds of the defenders remains nearly constant.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.