The paper discusses a statistical problem related to testing for differences between two sparse networks with community structures. The community-wise edge probability matrices have entries of order $O(n^{-1}/\log n)$, where $n$ represents the size of the network. The authors propose a test statistic that combines a method proposed by Wu et al. \cite{WuTwoSampleSBM2022} and a resampling process. They derive the asymptotic null distribution of the test statistic and provide a guarantee of asymptotic power against the alternative hypothesis. To evaluate the performance of the proposed test statistic, the authors conduct simulations and provide real data examples. The results indicate that the proposed test statistic performs well in practice.
Diversification of recommendation results is a promising approach for coping with the uncertainty associated with users' information needs. Of particular importance in diversified recommendation is to define and optimize an appropriate diversity objective. In this study, we revisit the most popular diversity objective called intra-list distance (ILD), defined as the average pairwise distance between selected items, and a similar but lesser known objective called dispersion, which is the minimum pairwise distance. Owing to their simplicity and flexibility, ILD and dispersion have been used in a plethora of diversified recommendation research. Nevertheless, we do not actually know what kind of items are preferred by them. We present a critical reexamination of ILD and dispersion from theoretical and experimental perspectives. Our theoretical results reveal that these objectives have potential drawbacks: ILD may select duplicate items that are very close to each other, whereas dispersion may overlook distant item pairs. As a competitor to ILD and dispersion, we design a diversity objective called Gaussian ILD, which can interpolate between ILD and dispersion by tuning the bandwidth parameter. We verify our theoretical results by experimental results using real-world data and confirm the extreme behavior of ILD and dispersion in practice.
Studies of alcohol and drug use are often interested in the number of days that people use the substance of interest over an interval, such as 28 days before a survey date. Although count models are often used for this purpose, they are not strictly appropriate for this type of data because the response variable is bounded above. Furthermore, if some peoples' substance use behaviors are characterized by various weekly patterns of use, summaries of substance days-of-use used over longer periods can exhibit multiple modes. These characteristics of substance days-of-use data are not easily fitted with conventional parametric model families. We propose a continuation ratio ordinal model for substance days-of-use data. Instead of grouping the set of possible response values into a small set of ordinal categories, each possible value is assigned its own category. This allows the exact numeric distribution implied by the predicted ordinal response to be recovered. We demonstrate the proposed model using survey data reporting days of alcohol use over 28-day intervals. We show the continuation ratio model is better able to capture the complexity in the drinking days dataset compared to binomial, hurdle-negative binomial and beta-binomial models.
The excellent performance of deep neural networks is usually accompanied by a large number of parameters and computations, which have limited their usage on the resource-limited edge devices. To address this issue, abundant methods such as pruning, quantization and knowledge distillation have been proposed to compress neural networks and achieved significant breakthroughs. However, most of these compression methods focus on the architecture or the training method of neural networks but ignore the influence from data augmentation. In this paper, we revisit the usage of data augmentation in model compression and give a comprehensive study on the relation between model sizes and their optimal data augmentation policy. To sum up, we mainly have the following three observations: (A) Models in different sizes prefer data augmentation with different magnitudes. Hence, in iterative pruning, data augmentation with varying magnitudes leads to better performance than data augmentation with a consistent magnitude. (B) Data augmentation with a high magnitude may significantly improve the performance of large models but harm the performance of small models. Fortunately, small models can still benefit from strong data augmentations by firstly learning them with "additional parameters" and then discard these "additional parameters" during inference. (C) The prediction of a pre-trained large model can be utilized to measure the difficulty of data augmentation. Thus it can be utilized as a criterion to design better data augmentation policies. We hope this paper may promote more research on the usage of data augmentation in model compression.
Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.
The (modern) arbitrary derivative (ADER) approach is a popular technique for the numerical solution of differential problems based on iteratively solving an implicit discretization of their weak formulation. In this work, focusing on an ODE context, we investigate several strategies to improve this approach. Our initial emphasis is on the order of accuracy of the method in connection with the polynomial discretization of the weak formulation. We demonstrate that precise choices lead to higher-order convergences in comparison to the existing literature. Then, we put ADER methods into a Deferred Correction (DeC) formalism. This allows to determine the optimal number of iterations, which is equal to the formal order of accuracy of the method, and to introduce efficient $p$-adaptive modifications. These are defined by matching the order of accuracy achieved and the degree of the polynomial reconstruction at each iteration. We provide analytical and numerical results, including the stability analysis of the new modified methods, the investigation of the computational efficiency, an application to adaptivity and an application to hyperbolic PDEs with a Spectral Difference (SD) space discretization.
Post-selection inference (PoSI) is a statistical technique for obtaining valid confidence intervals and p-values when hypothesis generation and testing use the same source of data. PoSI can be used on a range of popular algorithms including the Lasso. Data carving is a variant of PoSI in which a portion of held out data is combined with the hypothesis generating data at inference time. While data carving has attractive theoretical and empirical properties, existing approaches rely on computationally expensive MCMC methods to carry out inference. This paper's key contribution is to show that pivotal quantities can be constructed for the data carving procedure based on a known parametric distribution. Specifically, when the selection event is characterized by a set of polyhedral constraints on a Gaussian response, data carving will follow the sum of a normal and a truncated normal (SNTN), which is a variant of the truncated bivariate normal distribution. The main impact of this insight is that obtaining exact inference for data carving can be made computationally trivial, since the CDF of the SNTN distribution can be found using the CDF of a standard bivariate normal. A python package sntn has been released to further facilitate the adoption of data carving with PoSI.
In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
We study the problem of crowdsourced PAC learning of threshold functions. This is a challenging problem and only recently have query-efficient algorithms been established under the assumption that a noticeable fraction of the workers are perfect. In this work, we investigate a more challenging case where the majority may behave adversarially and the rest behave as the Massart noise - a significant generalization of the perfectness assumption. We show that under the {semi-verified model} of Charikar et al. (2017), where we have (limited) access to a trusted oracle who always returns correct annotations, it is possible to PAC learn the underlying hypothesis class with a manageable amount of label queries. Moreover, we show that the labeling cost can be drastically mitigated via the more easily obtained comparison queries. Orthogonal to recent developments in semi-verified or list-decodable learning that crucially rely on data distributional assumptions, our PAC guarantee holds by exploring the wisdom of the crowd.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.