Injuries to the knee joint are very common for long-distance and frequent runners, an issue which is often attributed to fatigue. We address the problem of fatigue detection from biomechanical data from different sources, consisting of lower extremity joint angles and ground reaction forces from running athletes with the goal of better understanding the impact of fatigue on the biomechanics of runners in general and on an individual level. This is done by sequentially testing for change in a datastream using a simple martingale test statistic. Time-uniform probabilistic martingale bounds are provided which are used as thresholds for the test statistic. Sharp bounds can be developed by a hybrid of a piece-wise linear- and a law of iterated logarithm- bound over all time regimes, where the probability of an early detection is controlled in a uniform way. If the underlying distribution of the data gradually changes over the course of a run, then a timely upcrossing of the martingale over these bounds is expected. The methods are developed for a setting when change sets in gradually in an incoming stream of data. Parameter selection for the bounds are based on simulations and methodological comparison is done with respect to existing advances. The algorithms presented here can be easily adapted to an online change-detection setting. Finally, we provide a detailed data analysis based on extensive measurements of several athletes and benchmark the fatigue detection results with the runners' individual feedback over the course of the data collection. Qualitative conclusions on the biomechanical profiles of the athletes can be made based on the shape of the martingale trajectories even in the absence of an upcrossing of the threshold.
A (multi)set of segments in the plane may form a TSP tour, a matching, a tree, or any multigraph. If two segments cross, then we can reduce the total length with the following flip operation. We remove a pair of crossing segments, and insert a pair of non-crossing segments, while keeping the same vertex degrees. The goal of this paper is to devise efficient strategies to flip the segments in order to obtain crossing-free segments after a small number of flips. Linear and near-linear bounds on the number of flips were only known for segments with endpoints in convex position. We generalize these results, proving linear and near-linear bounds for cases with endpoints that are not in convex position. Our results are proved in a general setting that applies to multiple problems, using multigraphs and the distinction between removal and insertion choices when performing a flip.
In this work, we give a statistical characterization of the $\gamma$-regret for arbitrary structured bandit problems, the regret which arises when comparing against a benchmark that is $\gamma$ times the optimal solution. The $\gamma$-regret emerges in structured bandit problems over a function class $\mathcal{F}$ where finding an exact optimum of $f \in \mathcal{F}$ is intractable. Our characterization is given in terms of the $\gamma$-DEC, a statistical complexity parameter for the class $\mathcal{F}$, which is a modification of the constrained Decision-Estimation Coefficient (DEC) of Foster et al., 2023 (and closely related to the original offset DEC of Foster et al., 2021). Our lower bound shows that the $\gamma$-DEC is a fundamental limit for any model class $\mathcal{F}$: for any algorithm, there exists some $f \in \mathcal{F}$ for which the $\gamma$-regret of that algorithm scales (nearly) with the $\gamma$-DEC of $\mathcal{F}$. We provide an upper bound showing that there exists an algorithm attaining a nearly matching $\gamma$-regret. Due to significant challenges in applying the prior results on the DEC to the $\gamma$-regret case, both our lower and upper bounds require novel techniques and a new algorithm.
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
We develop a methodology for modelling and simulating high-dimensional spatial precipitation extremes, using a combination of the spatial conditional extremes model, latent Gaussian models and integrated nested Laplace approximations (INLA). The spatial conditional extremes model requires data with Laplace marginal distributions, but precipitation distributions contain point masses at zero that complicate necessary standardisation procedures. We propose to model conditional extremes of nonzero precipitation only, while separately modelling precipitation occurrences. The two models are then combined to create a complete model for extreme precipitation. Nonzero precipitation marginals are modelled using a combination of latent Gaussian models with gamma and generalised Pareto likelihoods. Four different models for precipitation occurrence are investigated. New empirical diagnostics and parametric models are developed for describing components of the spatial conditional extremes model. We apply our framework to simulate spatial precipitation extremes over a water catchment in Central Norway, using high-density radar data. Inference on a 6000-dimensional data set is performed within hours, and the simulated data capture the main trends of the observed data well.
In inverse problems, one attempts to infer spatially variable functions from indirect measurements of a system. To practitioners of inverse problems, the concept of "information" is familiar when discussing key questions such as which parts of the function can be inferred accurately and which cannot. For example, it is generally understood that we can identify system parameters accurately only close to detectors, or along ray paths between sources and detectors, because we have "the most information" for these places. Although referenced in many publications, the "information" that is invoked in such contexts is not a well understood and clearly defined quantity. Herein, we present a definition of information density that is based on the variance of coefficients as derived from a Bayesian reformulation of the inverse problem. We then discuss three areas in which this information density can be useful in practical algorithms for the solution of inverse problems, and illustrate the usefulness in one of these areas -- how to choose the discretization mesh for the function to be reconstructed -- using numerical experiments.
Randomized algorithms, such as randomized sketching or projections, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized algorithms. We develop appropriate statistical methods -- sub-randomization, multi-run plug-in and multi-run aggregation inference -- by using multiple runs of the same randomized algorithm, or by estimating the unknown parameters of the limiting distribution. As an example, we develop methods for statistical inference for least squares parameters via random sketching using matrices with i.i.d.entries, or uniform partial orthogonal matrices. For this, we characterize the limiting distribution of estimators obtained via sketch-and-solve as well as partial sketching methods. The analysis of i.i.d. sketches uses a trigonometric interpolation argument to establish a differential equation for the limiting expected characteristic function and find the dependence on the kurtosis of the entries of the sketching matrix. The results are supported via a broad range of simulations.
Many food products involve mixtures of ingredients, where the mixtures can be expressed as combinations of ingredient proportions. In many cases, the quality and the consumer preference may also depend on the way in which the mixtures are processed. The processing is generally defined by the settings of one or more process variables. Experimental designs studying the joint impact of the mixture ingredient proportions and the settings of the process variables are called mixture-process variable experiments. In this article, we show how to combine mixture-process variable experiments and discrete choice experiments, to quantify and model consumer preferences for food products that can be viewed as processed mixtures. First, we describe the modeling of data from such combined experiments. Next, we describe how to generate D- and I-optimal designs for choice experiments involving mixtures and process variables, and we compare the two kinds of designs using two examples.
Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic. Two general and often independent lines of work exist, one focuses on addressing noisy labels, and another deals with hard samples. However, when both types of data are present, most existing methods treat them equally, which results in a decline in the overall performance of the model. In this paper, we first design various synthetic datasets with custom hardness and noisiness levels for different samples. Our proposed systematic empirical study enables us to better understand the similarities and more importantly the differences between hard-to-learn samples and incorrectly-labeled samples. These controlled experiments pave the way for the development of methods that distinguish between hard and noisy samples. Through our study, we introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples. We study various data partitioning methods in the presence of label noise and observe that filtering out noisy samples from hard samples with this proposed metric results in the best datasets as evidenced by the high test accuracy achieved after models are trained on the filtered datasets. We demonstrate this for both our created synthetic datasets and for datasets with real-world label noise. Furthermore, our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
Flexible modeling of how an entire distribution changes with covariates is an important yet challenging generalization of mean-based regression that has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices.
We study the problems of sequential nonparametric two-sample and independence testing. Sequential tests process data online and allow using observed data to decide whether to stop and reject the null hypothesis or to collect more data, while maintaining type I error control. We build upon the principle of (nonparametric) testing by betting, where a gambler places bets on future observations and their wealth measures evidence against the null hypothesis. While recently developed kernel-based betting strategies often work well on simple distributions, selecting a suitable kernel for high-dimensional or structured data, such as images, is often nontrivial. To address this drawback, we design prediction-based betting strategies that rely on the following fact: if a sequentially updated predictor starts to consistently determine (a) which distribution an instance is drawn from, or (b) whether an instance is drawn from the joint distribution or the product of the marginal distributions (the latter produced by external randomization), it provides evidence against the two-sample or independence nulls respectively. We empirically demonstrate the superiority of our tests over kernel-based approaches under structured settings. Our tests can be applied beyond the case of independent and identically distributed data, remaining valid and powerful even when the data distribution drifts over time.