Compositional data are contemporarily defined as positive vectors, the ratios among whose elements are of interest to the researcher. Financial statement analysis by means of accounting ratios fulfils this definition to the letter. Compositional data analysis solves the major problems in statistical analysis of standard financial ratios at industry level, such as skewness, non-normality, non-linearity and dependence of the results on the choice of which accounting figure goes to the numerator and to the denominator of the ratio. In spite of this, compositional applications to financial statement analysis are still rare. In this article, we present some transformations within compositional data analysis that are particularly useful for financial statement analysis. We show how to compute industry or sub-industry means of standard financial ratios from a compositional perspective. We show how to visualise firms in an industry with a compositional biplot, to classify them with compositional cluster analysis and to relate financial and non-financial indicators with compositional regression models. We show an application to the accounting statements of Spanish wineries using DuPont analysis, and a step-by-step tutorial to the compositional freeware CoDaPack.
In this paper, we show that in a parallel processing system, if a partial order is induced among the local states visited by a node, then synchronization cost can be eliminated. As a result of this partial order, a DAG is induced among the global states. Specifically, we show that in such systems, correctness is preserved even if the nodes execute asynchronously and read old information of other nodes. We present two variations for inducing DAGs -- \textit{DAG-inducing problems}, where the problem definition itself induces a DAG, and \textit{DAG-inducing algorithms}, where a DAG is induced by the algorithm. We demonstrate that the dominant clique (DC) problem and shortest path (SP) problem are DAG-inducing problems. Among these, DC allows self-stabilization, whereas the algorithm that we present for SP does not. We demonstrate that maximal matching (MM) is not a DAG-inducing problem. However, a DAG-inducing algorithm can be developed for it. The algorithm for MM allows self-stabilization. This algorithm converges in $2n$ moves and does not require a synchronous environment, which is an improvement over the existing algorithms in the literature. The algorithm for DC converges in $2m$ moves, and the algorithm for SP converges in $\mathcal{D}$ rounds. ($n$ is the number of nodes and $m$ is the number of edges in the input graph, and $\mathcal{D}$ is its diameter.) We also note that DAG-inducing problems are more general than, and encapsulate, lattice linear problems (Garg, SPAA 2020). Similarly, DAG-inducing algorithms encapsulate lattice linear algorithms (Gupta and Kulkarni, SSS 2022). We also show that a partial order induced among the local states visited by a node, as discussed above, is a necessary and sufficient condition to allow asynchrony.
Digital sources have been enabling unprecedented data-driven and large-scale investigations across a wide range of domains, including demography, sociology, geography, urbanism, criminology, and engineering. A major barrier to innovation is represented by the limited availability of dependable digital datasets, especially in the context of data gathered by mobile network operators or service providers, due to concerns about user privacy and industrial competition. The resulting lack of reference datasets curbs the production of new research methods and results, and prevents verifiability and reproducibility of research outcomes. The NetMob23 dataset offers a rare opportunity to the multidisciplinary research community to access rich data about the spatio-temporal consumption of mobile applications in a developed country. The generation process of the dataset sets a new quality standard, leading to information about the demands generated by 68 popular mobile services, geo-referenced at a high resolution of $100\times100$ $m^2$ over 20 metropolitan areas in France, and monitored during 77 consecutive days in 2019.
We develop domain theory in constructive and predicative univalent foundations (also known as homotopy type theory). That we work predicatively means that we do not assume Voevodsky's propositional resizing axioms. Our work is constructive in the sense that we do not rely on excluded middle or the axiom of (countable) choice. Domain theory studies so-called directed complete posets (dcpos) and Scott continuous maps between them and has applications in programming language semantics, higher-type computability and topology. A common approach to deal with size issues in a predicative foundation is to work with information systems, abstract bases or formal topologies rather than dcpos, and approximable relations rather than Scott continuous functions. In our type-theoretic approach, we instead accept that dcpos may be large and work with type universes to account for this. A priori one might expect that complex constructions of dcpos result in a need for ever-increasing universes and are predicatively impossible. We show that such constructions can be carried out in a predicative setting. We illustrate the development with applications in the semantics of programming languages: the soundness and computational adequacy of the Scott model of PCF and Scott's $D_\infty$ model of the untyped $\lambda$-calculus. We also give a predicative account of continuous and algebraic dcpos, and of the related notions of a small basis and its rounded ideal completion. The fact that nontrivial dcpos have large carriers is in fact unavoidable and characteristic of our predicative setting, as we explain in a complementary chapter on the constructive and predicative limitations of univalent foundations. Our account of domain theory in univalent foundations is fully formalised with only a few minor exceptions. The ability of the proof assistant Agda to infer universe levels has been invaluable for our purposes.
In many stochastic service systems, decision-makers find themselves making a sequence of decisions, with the number of decisions being unpredictable. To enhance these decisions, it is crucial to uncover the causal impact these decisions have through careful analysis of observational data from the system. However, these decisions are not made independently, as they are shaped by previous decisions and outcomes. This phenomenon is called sequential bias and violates a key assumption in causal inference that one person's decision does not interfere with the potential outcomes of another. To address this issue, we establish a connection between sequential bias and the subfield of causal inference known as dynamic treatment regimes. We expand these frameworks to account for the random number of decisions by modeling the decision-making process as a marked point process. Consequently, we can define and identify causal effects to quantify sequential bias. Moreover, we propose estimators and explore their properties, including double robustness and semiparametric efficiency. In a case study of 27,831 encounters with a large academic emergency department, we use our approach to demonstrate that the decision to route a patient to an area for low acuity patients has a significant impact on the care of future patients.
The theory of influences in product measures has profound applications in theoretical computer science, combinatorics, and discrete probability. This deep theory is intimately connected to functional inequalities and to the Fourier analysis of discrete groups. Originally, influences of functions were motivated by the study of social choice theory, wherein a Boolean function represents a voting scheme, its inputs represent the votes, and its output represents the outcome of the elections. Thus, product measures represent a scenario in which the votes of the parties are randomly and independently distributed, which is often far from the truth in real-life scenarios. We begin to develop the theory of influences for more general measures under mixing or correlation decay conditions. More specifically, we prove analogues of the KKL and Talagrand influence theorems for Markov Random Fields on bounded degree graphs with correlation decay. We show how some of the original applications of the theory of in terms of voting and coalitions extend to general measures with correlation decay. Our results thus shed light both on voting with correlated voters and on the behavior of general functions of Markov Random Fields (also called ``spin-systems") with correlation decay.
The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly correlated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
Recent studies have demonstrated how to assess the stereotypical bias in pre-trained English language models. In this work, we extend this branch of research in multiple different dimensions by systematically investigating (a) mono- and multilingual models of (b) different underlying architectures with respect to their bias in (c) multiple different languages. To that end, we make use of the English StereoSet data set (Nadeem et al., 2021), which we semi-automatically translate into German, French, Spanish, and Turkish. We find that it is of major importance to conduct this type of analysis in a multilingual setting, as our experiments show a much more nuanced picture as well as notable differences from the English-only analysis. The main takeaways from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical behavior across languages, English (monolingual) models exhibit the strongest bias, and the stereotypes reflected in the data set are least present in Turkish models. Finally, we release our codebase alongside the translated data sets and practical guidelines for the semi-automatic translation to encourage a further extension of our work to other languages.
We introduce a new quantum algorithm for computing the Betti numbers of a simplicial complex. In contrast to previous quantum algorithms that work by estimating the eigenvalues of the combinatorial Laplacian, our algorithm is an instance of the generic Incremental Algorithm for computing Betti numbers that incrementally adds simplices to the simplicial complex and tests whether or not they create a cycle. In contrast to existing quantum algorithms for computing Betti numbers that work best when the complex has close to the maximal number of simplices, our algorithm works best for sparse complexes. To test whether a simplex creates a cycle, we introduce a quantum span-program algorithm. We show that the query complexity of our span program is parameterized by quantities called the effective resistance and effective capacitance of the boundary of the simplex. Unfortunately, we also prove upper and lower bounds on the effective resistance and capacitance, showing both quantities can be exponentially large with respect to the size of the complex, implying that our algorithm would have to run for exponential time to exactly compute Betti numbers. However, as a corollary to these bounds, we show that the spectral gap of the combinatorial Laplacian can be exponentially small. As the runtime of all previous quantum algorithms for computing Betti numbers are parameterized by the inverse of the spectral gap, our bounds show that all quantum algorithms for computing Betti numbers must run for exponentially long to exactly compute Betti numbers. Finally, we prove some novel formulas for effective resistance and effective capacitance to give intuition for these quantities.
In this paper, we investigate the impact of numerical instability on the reliability of sampling, density evaluation, and evidence lower bound (ELBO) estimation in variational flows. We first empirically demonstrate that common flows can exhibit a catastrophic accumulation of error: the numerical flow map deviates significantly from the exact map -- which affects sampling -- and the numerical inverse flow map does not accurately recover the initial input -- which affects density and ELBO computations. Surprisingly though, we find that results produced by flows are often accurate enough for applications despite the presence of serious numerical instability. In this work, we treat variational flows as dynamical systems, and leverage shadowing theory to elucidate this behavior via theoretical guarantees on the error of sampling, density evaluation, and ELBO estimation. Finally, we develop and empirically test a diagnostic procedure that can be used to validate results produced by numerically unstable flows in practice.
Physically informed neural networks (PINNs) are a promising emerging method for solving differential equations. As in many other deep learning approaches, the choice of PINN design and training protocol requires careful craftsmanship. Here, we suggest a comprehensive theoretical framework that sheds light on this important problem. Leveraging an equivalence between infinitely over-parameterized neural networks and Gaussian process regression (GPR), we derive an integro-differential equation that governs PINN prediction in the large data-set limit -- the Neurally-Informed Equation (NIE). This equation augments the original one by a kernel term reflecting architecture choices and allows quantifying implicit bias induced by the network via a spectral decomposition of the source term in the original differential equation.