This paper presents a critical analysis of generative Artificial Intelligence (AI) detection tools in higher education assessments. The rapid advancement and widespread adoption of generative AI, particularly in education, necessitates a reevaluation of traditional academic integrity mechanisms. We explore the effectiveness, vulnerabilities, and ethical implications of AI detection tools in the context of preserving academic integrity. Our study synthesises insights from various case studies, newspaper articles, and student testimonies to scrutinise the practical and philosophical challenges associated with AI detection. We argue that the reliance on detection mechanisms is misaligned with the educational landscape, where AI plays an increasingly widespread role. This paper advocates for a strategic shift towards robust assessment methods and educational policies that embrace generative AI usage while ensuring academic integrity and authenticity in assessments.
Source code plagiarism is a significant issue in educational practice, and educators need user-friendly tools to cope with such academic dishonesty. This article introduces the latest version of Dolos, a state-of-the-art ecosystem of tools for detecting and preventing plagiarism in educational source code. In this new version, the primary focus has been on enhancing the user experience. Educators can now run the entire plagiarism detection pipeline from a new web app in their browser, eliminating the need for any installation or configuration. Completely redesigned analytics dashboards provide an instant assessment of whether a collection of source files contains suspected cases of plagiarism and how widespread plagiarism is within the collection. The dashboards support hierarchically structured navigation to facilitate zooming in and out of suspect cases. Clusters are an essential new component of the dashboard design, reflecting the observation that plagiarism can occur among larger groups of students. To meet various user needs, the Dolos software stack for source code plagiarism detections now includes a web interface, a JSON application programming interface (API), a command line interface (CLI), a JavaScript library and a preconfigured Docker container. Clear documentation and a free-to-use instance of the web app can be found at //dolos.ugent.be. The source code is also available on GitHub.
This paper introduces a novel evaluation framework for Large Language Models (LLMs) such as Llama-2 and Mistral, focusing on the adaptation of Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals significant insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges faced by current LLMs in generating diverse and high-quality text.
The possibility of unmeasured confounding is one of the main limitations for causal inference from observational studies. There are different methods for partially empirically assessing the plausibility of unconfoundedness. However, most currently available methods require (at least partial) assumptions about the confounding structure, which may be difficult to know in practice. In this paper we describe a simple strategy for empirically assessing the plausibility of conditional unconfoundedness (i.e., whether the candidate set of covariates suffices for confounding adjustment) which does not require any assumptions about the confounding structure, requiring instead assumptions related to temporal ordering between covariates, exposure and outcome (which can be guaranteed by design), measurement error and selection into the study. The proposed method essentially relies on testing the association between a subset of covariates (those associated with the exposure given all other covariates) and the outcome conditional on the remaining covariates and the exposure. We describe the assumptions underlying the method, provide proofs, use simulations to corroborate the theory and illustrate the method with an applied example assessing the causal effect of length-for-age measured in childhood and intelligence quotient measured in adulthood using data from the 1982 Pelotas (Brazil) birth cohort. We also discuss the implications of measurement error and some important limitations.
The increasing requirements for data protection and privacy has attracted a huge research interest on distributed artificial intelligence and specifically on federated learning, an emerging machine learning approach that allows the construction of a model between several participants who hold their own private data. In the initial proposal of federated learning the architecture was centralised and the aggregation was done with federated averaging, meaning that a central server will orchestrate the federation using the most straightforward averaging strategy. This research is focused on testing different federated strategies in a peer-to-peer environment. The authors propose various aggregation strategies for federated learning, including weighted averaging aggregation, using different factors and strategies based on participant contribution. The strategies are tested with varying data sizes to identify the most robust ones. This research tests the strategies with several biomedical datasets and the results of the experiments show that the accuracy-based weighted average outperforms the classical federated averaging method.
Objective This study investigates what kind of conceptions primary school students have about ML if they are not conceptually "primed" with the idea that in ML, humans teach computers. Method Qualitative survey responses from 197 Finnish primary schoolers were analyzed via an abductive method. Findings We identified three partly overlapping ML conception categories, starting from the most accurate one: ML is about teaching machines (34%), ML is about coding (7.6%), and ML is about learning via or about machines (37.1%). Implications The findings suggest that without conceptual clues, children's conceptions of ML are varied and may include misconceptions such as ML is about learning via or about machines. The findings underline the importance of clear and systematic use of key concepts in computer science education. Besides researchers, this study offers insights for teachers, teacher educators, curriculum developers, and policymakers. Method Qualitative survey responses from 197 Finnish primary schoolers were analyzed via an abductive method. Findings We identified three partly overlapping ML conception categories, starting from the most accurate one: ML is about teaching machines (34%), ML is about coding (7.6%), and ML is about learning via or about machines (37.1%). Implications The findings suggest that without conceptual clues, children's conceptions of ML are varied and may include misconceptions such as ML is about learning via or about machines. The findings underline the importance of clear and systematic use of key concepts in computer science education. Besides researchers, this study offers insights for teachers, teacher educators, curriculum developers, and policymakers.
What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.
Instrumental variables are widely used in econometrics and epidemiology for identifying and estimating causal effects when an exposure of interest is confounded by unmeasured factors. Despite this popularity, the assumptions invoked to justify the use of instruments differ substantially across the literature. Similarly, statistical approaches for estimating the resulting causal quantities vary considerably, and often rely on strong parametric assumptions. In this work, we compile and organize structural conditions that nonparametrically identify conditional average treatment effects, average treatment effects among the treated, and local average treatment effects, with a focus on identification formulae invoking the conditional Wald estimand. Moreover, we build upon existing work and propose nonparametric efficient estimators of functionals corresponding to marginal and conditional causal contrasts resulting from the various identification paradigms. We illustrate the proposed methods on an observational study examining the effects of operative care on adverse events for cholecystitis patients, and a randomized trial assessing the effects of market participation on political views.
Motivated by the important statistical role of sparsity, the paper uncovers four reparametrizations for covariance matrices in which sparsity is associated with conditional independence graphs in a notional Gaussian model. The intimate relationship between the Iwasawa decomposition of the general linear group and the open cone of positive definite matrices allows a unifying perspective. Specifically, the positive definite cone can be reconstructed without loss or redundancy from the exponential map applied to four Lie subalgebras determined by the Iwasawa decomposition of the general linear group. This accords geometric interpretations to the reparametrizations and the corresponding notion of sparsity. Conditions that ensure legitimacy of the reparametrizations for statistical models are identified. While the focus of this work is on understanding population-level structure, there are strong methodological implications. In particular, since the population-level sparsity manifests in a vector space, imposition of sparsity on relevant sample quantities produces a covariance estimate that respects the positive definite cone constraint.
Artificial intelligence (AI) advances and the rapid adoption of generative AI tools like ChatGPT present new opportunities and challenges for higher education. While substantial literature discusses AI in higher education, there is a lack of a systemic approach that captures a holistic view of the AI transformation of higher education institutions (HEIs). To fill this gap, this article, taking a complex systems approach, develops a causal loop diagram (CLD) to map the causal feedback mechanisms of AI transformation in a typical HEI. Our model accounts for the forces that drive the AI transformation and the consequences of the AI transformation on value creation in a typical HEI. The article identifies and analyzes several reinforcing and balancing feedback loops, showing how, motivated by AI technology advances, the HEI invests in AI to improve student learning, research, and administration. The HEI must take measures to deal with academic integrity problems and adapt to changes in available jobs due to AI, emphasizing AI-complementary skills for its students. However, HEIs face a competitive threat and several policy traps that may lead to decline. HEI leaders need to become systems thinkers to manage the complexity of the AI transformation and benefit from the AI feedback loops while avoiding the associated pitfalls. We also discuss long-term scenarios, the notion of HEIs influencing the direction of AI, and directions for future research on AI transformation.
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time, and storage.