Non-Fungible Token (NFT) marketplaces on the Ethereum blockchain saw an astonishing growth in 2021. The trend does not seem to stop, with a monthly trading volume of \$6 billion in January 2022. However, questions have arisen about such a high trading volume. The primary concern is wash trading, a market manipulation in which a single entity trades an NFT multiple times to increase the volume artificially. This paper describes several methodologies for identifying wash trading in Ethereum, from its inception to January 2022, and explores the tangible impact on NFTs. We found that the collections affected by wash trading are 5.66% of all the collections, with a total artificial volume of \$3,406,110,774. We study two different ways of profiting from wash trading: Increasing the price of NFTs by showing artificial interest on the asset, and exploiting the reward token system of some marketplaces. We show that the latter is safer for wash traders since it guarantees a higher expected profit. Our findings indicate that wash trading is a frequent event in the blockchain eco-system, that reward token systems can stimulate market manipulations, and that marketplaces can introduce countermeasures by using the methodologies described in this paper.
With the increasing importance of data and artificial intelligence, organizations strive to become more data-driven. However, current data architectures are not necessarily designed to keep up with the scale and scope of data and analytics use cases. In fact, existing architectures often fail to deliver the promised value associated with them. Data mesh is a socio-technical concept that includes architectural aspects to promote data democratization and enables organizations to become truly data-driven. As the concept of data mesh is still novel, it lacks empirical insights from the field. Specifically, an understanding of the motivational factors for introducing data mesh, the associated challenges, best practices, its business impact, and potential archetypes, is missing. To address this gap, we conduct 15 semi-structured interviews with industry experts. Our results show, among other insights, that industry experts have difficulties with the transition toward federated governance associated with the data mesh concept, the shift of responsibility for the development, provision, and maintenance of data products, and the concept of a data product model. In our work, we derive multiple best practices and suggest organizations embrace elements of data fabric, observe the data product usage, create quick wins in the early phases, and favor small dedicated teams that prioritize data products. While we acknowledge that organizations need to apply best practices according to their individual needs, we also deduct two archetypes that provide suggestions in more detail. Our findings synthesize insights from industry experts and provide researchers and professionals with guidelines for the successful adoption of data mesh.
A treatment policy defines when and what treatments are applied to affect some outcome of interest. Data-driven decision-making requires the ability to predict what happens if a policy is changed. Existing methods that predict how the outcome evolves under different scenarios assume that the tentative sequences of future treatments are fixed in advance, while in practice the treatments are determined stochastically by a policy and may depend for example on the efficiency of previous treatments. Therefore, the current methods are not applicable if the treatment policy is unknown or a counterfactual analysis is needed. To handle these limitations, we model the treatments and outcomes jointly in continuous time, by combining Gaussian processes and point processes. Our model enables the estimation of a treatment policy from observational sequences of treatments and outcomes, and it can predict the interventional and counterfactual progression of the outcome after an intervention on the treatment policy (in contrast with the causal effect of a single treatment). We show with real-world and semi-synthetic data on blood glucose progression that our method can answer causal queries more accurately than existing alternatives.
In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.
In this paper, we present a numerical strategy to check the strong stability (or GKS-stability) of one-step explicit finite difference schemes for the one-dimensional advection equation with an inflow boundary condition. The strong stability is studied using the Kreiss-Lopatinskii theory. We introduce a new tool, the intrinsic Kreiss-Lopatinskii determinant, which possesses the same regularity as the vector bundle of discrete stable solutions. By applying standard results of complex analysis to this determinant, we are able to relate the strong stability of numerical schemes to the computation of a winding number, which is robust and cheap. The study is illustrated with the O3 scheme and the fifth-order Lax-Wendroff (LW5) scheme together with a reconstruction procedure at the boundary.
The Lov\'{a}sz Local Lemma (LLL) is a keystone principle in probability theory, guaranteeing the existence of configurations which avoid a collection $\mathcal B$ of "bad" events which are mostly independent and have low probability. In its simplest "symmetric" form, it asserts that whenever a bad-event has probability $p$ and affects at most $d$ bad-events, and $e p d < 1$, then a configuration avoiding all $\mathcal B$ exists. A seminal algorithm of Moser & Tardos (2010) gives nearly-automatic randomized algorithms for most constructions based on the LLL. However, deterministic algorithms have lagged behind. We address three specific shortcomings of the prior deterministic algorithms. First, our algorithm applies to the LLL criterion of Shearer (1985); this is more powerful than alternate LLL criteria and also removes a number of nuisance parameters and leads to cleaner and more legible bounds. Second, we provide parallel algorithms with much greater flexibility in the functional form of of the bad-events. Third, we provide a derandomized version of the MT-distribution, that is, the distribution of the variables at the termination of the MT algorithm. We show applications to non-repetitive vertex coloring, independent transversals, strong coloring, and other problems. These give deterministic algorithms which essentially match the best previous randomized sequential and parallel algorithms.
In consumer theory, ranking available objects by means of preference relations yields the most common description of individual choices. However, preference-based models assume that individuals: (1) give their preferences only between pairs of objects; (2) are always able to pick the best preferred object. In many situations, they may be instead choosing out of a set with more than two elements and, because of lack of information and/or incomparability (objects with contradictory characteristics), they may not able to select a single most preferred object. To address these situations, we need a choice-model which allows an individual to express a set-valued choice. Choice functions provide such a mathematical framework. We propose a Gaussian Process model to learn choice functions from choice-data. The proposed model assumes a multiple utility representation of a choice function based on the concept of Pareto rationalization, and derives a strategy to learn both the number and the values of these latent multiple utilities. Simulation experiments demonstrate that the proposed model outperforms the state-of-the-art methods.
The linear combination of Student's $t$ random variables (RVs) appears in many statistical applications. Unfortunately, the Student's $t$ distribution is not closed under convolution, thus, deriving an exact and general distribution for the linear combination of $K$ Student's $t$ RVs is infeasible, which motivates a fitting/approximation approach. Here, we focus on the scenario where the only constraint is that the number of degrees of freedom of each $t-$RV is greater than two. Notice that since the odd moments/cumulants of the Student's $t$ distribution are zero, and the even moments/cumulants do not exist when their order is greater than the number of degrees of freedom, it becomes impossible to use conventional approaches based on moments/cumulants of order one or higher than two. To circumvent this issue, herein we propose fitting such a distribution to that of a scaled Student's $t$ RV by exploiting the second moment together with either the first absolute moment or the characteristic function (CF). For the fitting based on the absolute moment, we depart from the case of the linear combination of $K= 2$ Student's $t$ RVs and then generalize to $K\ge 2$ through a simple iterative procedure. Meanwhile, the CF-based fitting is direct, but its accuracy (measured in terms of the Bhattacharyya distance metric) depends on the CF parameter configuration, for which we propose a simple but accurate approach. We numerically show that the CF-based fitting usually outperforms the absolute moment -based fitting and that both the scale and number of degrees of freedom of the fitting distribution increase almost linearly with $K$.
It has been a long time that computer architecture and systems are optimized to enable efficient execution of machine learning (ML) algorithms or models. Now, it is time to reconsider the relationship between ML and systems, and let ML transform the way that computer architecture and systems are designed. This embraces a twofold meaning: the improvement of designers' productivity, and the completion of the virtuous cycle. In this paper, we present a comprehensive review of work that applies ML for system design, which can be grouped into two major categories, ML-based modelling that involves predictions of performance metrics or some other criteria of interest, and ML-based design methodology that directly leverages ML as the design tool. For ML-based modelling, we discuss existing studies based on their target level of system, ranging from the circuit level to the architecture/system level. For ML-based design methodology, we follow a bottom-up path to review current work, with a scope of (micro-)architecture design (memory, branch prediction, NoC), coordination between architecture/system and workload (resource allocation and management, data center management, and security), compiler, and design automation. We further provide a future vision of opportunities and potential directions, and envision that applying ML for computer architecture and systems would thrive in the community.
Machine learning plays a role in many deployed decision systems, often in ways that are difficult or impossible to understand by human stakeholders. Explaining, in a human-understandable way, the relationship between the input and output of machine learning models is essential to the development of trustworthy machine-learning-based systems. A burgeoning body of research seeks to define the goals and methods of explainability in machine learning. In this paper, we seek to review and categorize research on counterfactual explanations, a specific class of explanation that provides a link between what could have happened had input to a model been changed in a particular way. Modern approaches to counterfactual explainability in machine learning draw connections to the established legal doctrine in many countries, making them appealing to fielded systems in high-impact areas such as finance and healthcare. Thus, we design a rubric with desirable properties of counterfactual explanation algorithms and comprehensively evaluate all currently-proposed algorithms against that rubric. Our rubric provides easy comparison and comprehension of the advantages and disadvantages of different approaches and serves as an introduction to major research themes in this field. We also identify gaps and discuss promising research directions in the space of counterfactual explainability.
Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge amount of labeled data that is expensive to collect. To address it, many efforts have been made on training complex models with small data in an unsupervised and semi-supervised fashion. In this paper, we will review the recent progresses on these two major categories of methods. A wide spectrum of small data models will be categorized in a big picture, where we will show how they interplay with each other to motivate explorations of new ideas. We will review the criteria of learning the transformation equivariant, disentangled, self-supervised and semi-supervised representations, which underpin the foundations of recent developments. Many instantiations of unsupervised and semi-supervised generative models have been developed on the basis of these criteria, greatly expanding the territory of existing autoencoders, generative adversarial nets (GANs) and other deep networks by exploring the distribution of unlabeled data for more powerful representations. While we focus on the unsupervised and semi-supervised methods, we will also provide a broader review of other emerging topics, from unsupervised and semi-supervised domain adaptation to the fundamental roles of transformation equivariance and invariance in training a wide spectrum of deep networks. It is impossible for us to write an exclusive encyclopedia to include all related works. Instead, we aim at exploring the main ideas, principles and methods in this area to reveal where we are heading on the journey towards addressing the small data challenges in this big data era.