We introduce continuous $R$-valuations on directed-complete posets (dcpos, for short), as a generalization of continuous valuations in domain theory, by extending values of continuous valuations from reals to so-called Abelian d-rags $R$. Like the valuation monad $\mathbf{V}$ introduced by Jones and Plotkin, we show that the construction of continuous $R$-valuations extends to a strong monad $\mathbf{V}^R$ on the category of dcpos and Scott-continuous maps. Additionally, and as in recent work by the two authors and C. Th\'eron, and by the second author, B. Lindenhovius, M. Mislove and V. Zamdzhiev, we show that we can extract a commutative monad $\mathbf{V}^R_m$ out of it, whose elements we call minimal $R$-valuations. We also show that continuous $R$-valuations have close connections to measures when $R$ is taken to be $\mathbf{I}\mathbb{R}^\star_+$, the interval domain of the extended nonnegative reals: (1) On every coherent topological space, every non-zero, bounded $\tau$-smooth measure $\mu$ (defined on the Borel $\sigma$-algebra), canonically determines a continuous $\mathbf{I}\mathbb{R}^\star_+$-valuation; and (2) such a continuous $\mathbf{I}\mathbb{R}^\star_+$-valuation is the most precise (in a certain sense) continuous $\mathbf{I}\mathbb{R}^\star_+$-valuation that approximates $\mu$, when the support of $\mu$ is a compact Hausdorff subspace of a second-countable stably compact topological space. This in particular applies to Lebesgue measure on the unit interval. As a result, the Lebesgue measure can be identified as a continuous $\mathbf{I}\mathbb{R}^\star_+$-valuation. Additionally, we show that the latter is minimal.
Mobile applications are required to give privacy notices to the users when they collect or share personal information. Creating consistent and concise privacy notices can be a challenging task for developers. Previous work has attempted to help developers create privacy notices through a questionnaire or predefined templates. In this paper, we propose a novel approach and a framework, called PriGen, that extends these prior work. PriGen uses static analysis to identify Android applications' code segments which process sensitive information (i.e. permission-requiring code segments) and then leverages a Neural Machine Translation model to translate them into privacy captions. We present the initial evaluation of our translation task for $\sim$300,000 code segments.
For a convolutional code in the presence of a symbol erasure channel, the information debt $I(t)$ at time $t$ provides a measure of the number of additional code symbols required to recover all message symbols up to time $t$. Information-debt-optimal streaming ($i$DOS) codes are convolutional codes which allow for the recovery of all message symbols up to $t$ whenever $I(t)$ turns zero under the following conditions; (i) information debt can be non-zero for at most $\tau$ consecutive time slots and (ii) information debt never increases beyond a particular threshold. The existence of periodically-time-varying $i$DOS codes are known for all parameters. In this paper, we address the problem of constructing explicit, time-invariant $i$DOS codes. We present an explicit time-invariant construction of $i$DOS codes for the unit memory ($m=1$) case. It is also shown that a construction method for convolutional codes due to Almeida et al. leads to explicit time-invariant $i$DOS codes for all parameters. However, this general construction requires a larger field size than the first construction for the $m=1$ case.
We obtain bounds to quantify the distributional approximation in the delta method for vector statistics (the sample mean of $n$ independent random vectors) for normal and non-normal limits, measured using smooth test functions. For normal limits, we obtain bounds of the optimal order $n^{-1/2}$ rate of convergence, but for a wide class of non-normal limits, which includes quadratic forms amongst others, we achieve bounds with a faster order $n^{-1}$ convergence rate. We apply our general bounds to derive explicit bounds to quantify distributional approximations of an estimator for Bernoulli variance, several statistics of sample moments, order $n^{-1}$ bounds for the chi-square approximation of a family of rank-based statistics, and we also provide an efficient independent derivation of an order $n^{-1}$ bound for the chi-square approximation of Pearson's statistic. In establishing our general results, we generalise recent results on Stein's method for functions of multivariate normal random vectors to vector-valued functions and sums of independent random vectors whose components may be dependent. These bounds are widely applicable and are of independent interest.
Applying simple linear regression models, an economist analysed a published dataset from an influential annual ranking in 2016 and 2017 of consumer outlets for Dutch New Herring and concluded that the ranking was manipulated. His finding was promoted by his university in national and international media, and this led to public outrage and ensuing discontinuation of the survey. We reconstitute the dataset, correcting errors and exposing features already important in a descriptive analysis of the data. The economist has continued his investigations, and in a follow-up publication repeats the same accusations. We point out errors in his reasoning and show that alleged evidence for deliberate manipulation of the ranking could easily be an artefact of specification errors. Temporal and spatial factors are both important and complex, and their effects cannot be captured using simple models, given the small sample sizes and many factors determining perceived taste of a food product.
In this paper, we study the identifiability and the estimation of the parameters of a copula-based multivariate model when the margins are unknown and are arbitrary, meaning that they can be continuous, discrete, or mixtures of continuous and discrete. When at least one margin is not continuous, the range of values determining the copula is not the entire unit square and this situation could lead to identifiability issues that are discussed here. Next, we propose estimation methods when the margins are unknown and arbitrary, using pseudo log-likelihood adapted to the case of discontinuities. In view of applications to large data sets, we also propose a pairwise composite pseudo log-likelihood. These methodologies can also be easily modified to cover the case of parametric margins. One of the main theoretical result is an extension to arbitrary distributions of known convergence results of rank-based statistics when the margins are continuous. As a by-product, under smoothness assumptions, we obtain that the asymptotic distribution of the estimation errors of our estimators are Gaussian. Finally, numerical experiments are presented to assess the finite sample performance of the estimators, and the usefulness of the proposed methodologies is illustrated with a copula-based regression model for hydrological data. The proposed estimation is implemented in the R package CopulaInference, together with a function for checking identifiability.
Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take precedence of post-hoc variance reduction techniques in order to enable the scaling of the number of experiments. We present an evaluation framework that, along with a few simple rule of thumbs, allow experimenters to quickly compare which experiment setup will lead to the highest probability of detecting a treatment effect under their particular circumstance.
Myerson's regularity condition of a distribution is a standard assumption in economics. In this paper, we study the complexity of describing a regular distribution within a small statistical distance. Our main result is that $\tilde{\Theta}{(\epsilon^{-0.5})}$ bits are necessary and sufficient to describe a regular distribution with support $[0,1]$ within $\epsilon$ Levy distance. We prove this by showing that we can learn the regular distribution approximately with $\tilde{O}(\epsilon^{-0.5})$ queries to the cumulative density function. As a corollary, we show that the pricing query complexity to learn the class of regular distribution with support $[0,1]$ within $\epsilon$ Levy distance is $\tilde{\Theta}{(\epsilon^{-2.5})}$. To learn the mixture of two regular distributions, $\tilde{\Theta}(\epsilon^{-3})$ pricing queries are required.
We propose FedGT, a novel framework for identifying malicious clients in federated learning with secure aggregation. Inspired by group testing, the framework leverages overlapping groups of clients to detect the presence of malicious clients in the groups and to identify them via a decoding operation. The identified clients are then removed from the training of the model, which is performed over the remaining clients. FedGT strikes a balance between privacy and security, allowing for improved identification capabilities while still preserving data privacy. Specifically, the server learns the aggregated model of the clients in each group. The effectiveness of FedGT is demonstrated through extensive experiments on the MNIST and CIFAR-10 datasets, showing its ability to identify malicious clients with low misdetection and false alarm probabilities, resulting in high model utility.
Machine learning models exhibit two seemingly contradictory phenomena: training data memorization, and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models "forget" the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convex models can memorize data forever in the worst-case, standard image, speech, and language models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets - for instance those examples used to pre-train a model - may observe privacy benefits at the expense of examples seen later.
There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.