宁毅静平公主小说免费阅读_亚洲国产中文在线有精品_黄色免费一区二区_久久人人爽精品玩人妻AV_久久综合九色综合免费99_AV岛国不卡动作片在线观看_成人无码A在线播放

from arxiv, Proc. of the ACM India Joint International Conference on Data Sciences and Management of Data (CODS-COMAD) 2022 (9th ACM IKDD CODS and 27th COMAD) - To Appear

A good data visualization is not only a distortion-free graphical representation of data but also a way to reveal underlying statistical properties of the data. Despite its common use across various stages of data analysis, selecting a good visualization often is a manual process involving many iterations. Recently there has been interest in reducing this effort by developing models that can recommend visualizations, but they are of limited use since they require large training samples (data and visualization pairs) and focus primarily on the design aspects rather than on assessing the effectiveness of the selected visualization. In this paper, we present VizAI, a generative-discriminative framework that first generates various statistical properties of the data from a number of alternative visualizations of the data. It is linked to a discriminative model that selects the visualization that best matches the true statistics of the data being visualized. VizAI can easily be trained with minimal supervision and adapts to settings with varying degrees of supervision easily. Using crowd-sourced judgements and a large repository of publicly available visualizations, we demonstrate that VizAI outperforms the state of the art methods that learn to recommend visualizations.

相關內容

統計量

關注 3

MoDELS · PCA · 統計量 · 主成分回歸 · 數據分析 ·

2022 年 1 月 11 日

ConsumerCheck: A Software for Analysis of Sensory and Consumer Data

Oliver Tomic,Alexandra Kuznetsova,Per Bruun Brockhoff,Thomas Graff,Tormod N?s

from arxiv, 37 pages inculding references, 41 figures

ConsumerCheck is an open source data analysis software tailored for analysis of sensory and consumer data. Since some of the implemented methods are generic, such as PCA, PLSR and PCR, other data from other domains may also be analysed with ConsumerCheck. The software comes with a graphical user interface and as such provides non-statisticians and users without programming skills free access to a number of widely used analysis methods within the field of sensory and consumer science. Computational results are presented in plots that are easily generated from the tree-controls within the graphical user interfaces. Since the construction of conjoint analysis models is not always straightforward, ConsumerCheck provides three previously defined model structures of different complexity. ConsumerCheck is an ongoing research project and the objective is to implement further statistical methods over time.

縮放 · 統計量 · 推斷 · 頻率主義學派 · 可辨認的 ·

2022 年 1 月 10 日

A Statistical Analysis of Compositional Surveys

Michelle Pistner Nixon,Jeffrey Letourneau,Lawrence David,Sayan Mukherjee,Justin D. Silverman

A common statistical problem is inference from positive-valued multivariate measurements where the scale (e.g., sum) of the measurements are not representative of the scale (e.g., total size) of the system being studied. This situation is common in the analysis of modern sequencing data. The field of Compositional Data Analysis (CoDA) axiomatically states that analyses must be invariant to scale. Yet, many scientific questions rely on the unmeasured system scale for identifiability. Instead, many existing tools make a wide variety of assumptions to identify models, often imputing the unmeasured scale. Here, we analyze the theoretical limits on inference given these data and formalize the assumptions required to provide principled scale reliant inference. Using statistical concepts such as consistency and calibration, we show that we can provide guidance on how to make scale reliant inference from these data. We prove that the Frequentist ideal is often unachievable and that existing methods can demonstrate bias and a breakdown of Type-I error control. We introduce scale simulation estimators and scale sensitivity analysis as a rigorous, flexible, and computationally efficient means of performing scale reliant inference.

Atom（文本編輯器） · 優化器 · INFORMS · AIM · 相關系數 ·

2022 年 1 月 10 日

Optimal radial basis for density-based atomic representations

Alexander Goscinski,Félix Musil,Sergey Pozdnyakov,Michele Ceriotti

The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.

統計量 · 穩健性 · 基準 · INTERACT · Performer ·

2022 年 1 月 8 日

A Baseline Statistical Method For Robust User-Assisted Multiple Segmentation

Huseyin Afser

from arxiv, Submitted to IEEE Signal Processing Letters. Is a continuation to our work: H. Af\c{s}er, "Statistical Classification via Robust Hypothesis Testing: Non-Asymptotic and Simple Bounds," in IEEE Signal Processing Letters, vol. 28, pp. 2112-2116, 2021

Recently, several image segmentation methods that welcome and leverage different types of user assistance have been developed. In these methods, the user inputs can be provided by drawing bounding boxes over image objects, drawing scribbles or planting seeds that help to differentiate between image boundaries or by interactively refining the missegmented image regions. Due to the variety in the types and the amounts of these inputs, relative assessment of different segmentation methods becomes difficult. As a possible solution, we propose a simple yet effective, statistical segmentation method that can handle and utilize different input types and amounts. The proposed method is based on robust hypothesis testing, specifically the DGL test, and can be implemented with time complexity that is linear in the number of pixels and quadratic in the number of image regions. Therefore, it is suitable to be used as a baseline method for quick benchmarking and assessing the relative performance improvements of different types of user-assisted segmentation algorithms. We provide a mathematical analysis on the operation of the proposed method, discuss its capabilities and limitations, provide design guidelines and present simulations that validate its operation.

模型評估 · Neural Networks · Weight · 哈希學習 · Networking ·

2022 年 1 月 7 日

GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks

Ping Li,Weijie Zhao

We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.

估計/估計量 · 散度 · 統計量 · 卡方（分布） · 近似 ·

2022 年 1 月 4 日

Neural Estimation of Statistical Divergences

Sreejith Sreekumar,Ziv Goldfeld

Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are near minimax rate-optimal, achieving the parametric rate up to logarithmic factors.

過擬合 · SimPLe · Principle · 模型評估 · 統計量 ·

2021 年 3 月 16 日

Deep learning: a statistical viewpoint

Peter L. Bartlett,Andrea Montanari,Alexander Rakhlin

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

數據增強 · 泛化理論 · 矩 · 規范化的 · surge ·

2020 年 2 月 25 日

On Feature Normalization and Data Augmentation

Boyi Li,Felix Wu,Ser-Nam Lim,Serge Belongie,Kilian Q. Weinberger

Modern neural network training relies heavily on data augmentation for improved generalization. After the initial success of label-preserving augmentations, there has been a recent surge of interest in label-perturbing approaches, which combine features and labels across training samples to smooth the learned decision surface. In this paper, we propose a new augmentation method that leverages the first and second moments extracted and re-injected by feature normalization. We replace the moments of the learned features of one training image by those of another, and also interpolate the target labels. As our approach is fast, operates entirely in feature space, and mixes different signals than prior methods, one can effectively combine it with existing augmentation methods. We demonstrate its efficacy across benchmark data sets in computer vision, speech, and natural language processing, where it consistently improves the generalization performance of highly competitive baseline networks.

MINE · CC · 數據挖掘 · LD · MoDELS ·

2018 年 10 月 5 日

Semantics of Data Mining Services in Cloud Computing

Manuel Parra-Royon,Ghislain Atemezing,J. M. Benítez

from arxiv, In-depth review. Fixed mistakes

In recent years with the rise of Cloud Computing (CC), many companies providing services in the cloud, are empowered a new series of services to their catalog, such as data mining (DM) and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been proposed to address the problem of describing services in CC in a comprehensive way. Bearing in mind that each provider has its own definition of the logic of its services, and specifically of DM services, it should be pointed out that the possibility of describing services in a flexible way between providers is fundamental in order to maintain the usability and portability of this type of CC services. The use of semantic technologies based on the proposal offered by Linked Data (LD) for the definition of services, allows the design and modelling of DM services, achieving a high degree of interoperability. In this article a schema for the definition of DM services on CC is presented, in addition are considered all key aspects of service in CC, such as prices, interfaces, Software Level Agreement, instances or workflow of experimentation, among others. The proposal presented is based on LD, so that it reuses other schemata obtaining a best definition of the service. For the validation of the schema, a series of DM services have been created where some of the best known algorithms such as \textit{Random Forest} or \textit{KMeans} are modeled as services.

視覺問答 · 可理解性 · 自動問答 · INFORMS · Performer ·

2018 年 1 月 24 日

DVQA: Understanding Data Visualizations via Question Answering

Kushal Kafle,Scott Cohen,Brian Price,Christopher Kanan

Bar charts are an effective way for humans to convey information to each other, but today's algorithms cannot parse them. Existing methods fail when faced with minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to a particular bar chart. State-of-the-art VQA algorithms perform poorly on DVQA, and we propose two strong baselines that perform considerably better. Our work will enable algorithms to automatically extract semantic information from vast quantities of literature in science, business, and other areas.