97SE亚洲国产综合在线,在线播放一区二区三区,亚洲午夜三级片免费观看

Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).

相關內容

PCA

關注 3

在統計中，主成分分析（PCA）是一種通過最大化每個維度的方差來將較高維度空間中的數據投影到較低維度空間中的方法。給定二維，三維或更高維空間中的點集合，可以將“最佳擬合”線定義為最小化從點到線的平均平方距離的線。可以從垂直于第一條直線的方向類似地選擇下一條最佳擬合線。重復此過程會產生一個正交的基礎，其中數據的不同單個維度是不相關的。這些基向量稱為主成分。

MoDELS · 數據增強 · 輸出 · 講稿 · AIM ·

2023 年 5 月 9 日

Consistent Text Categorization using Data Augmentation in e-Commerce

Guy Horowitz,Stav Yanovsky Daye,Noa Avigdor-Elgrabli,Ariel Raviv

The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.

INFORMS · Processing（編程語言） · Learning · 可約的 · 隨機采樣 ·

2023 年 5 月 9 日

Medical Image Deidentification, Cleaning and Compression Using Pylogik

Adrienne Kline,Vinesh Appadurai,Yuan Luo,Sanjiv Shah

from arxiv, updates needed to manuscript

Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and deidentified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology deidentifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in the identification of regions of interest (ROI), a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average approximately 72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for ultrasound data cleaning and deidentification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data.

Performer · Learning · 可辨認的 · 深度學習 · state-of-the-art ·

2023 年 5 月 9 日

Leveraging Deep Learning and Digital Twins to Improve Energy Performance of Buildings

Zhongjun Ni,Chi Zhang,Magnus Karlsson,Shaofang Gong

from arxiv, 6 pages, 5 figures, accepted in the 3rd IEEE International Conference on Industrial Electronics for Sustainable Energy Systems

Digital transformation in buildings accumulates massive operational data, which calls for smart solutions to utilize these data to improve energy performance. This study has proposed a solution, namely Deep Energy Twin, for integrating deep learning and digital twins to better understand building energy use and identify the potential for improving energy efficiency. Ontology was adopted to create parametric digital twins to provide consistency of data format across different systems in a building. Based on created digital twins and collected data, deep learning methods were used for performing data analytics to identify patterns and provide insights for energy optimization. As a demonstration, a case study was conducted in a public historic building in Norrk\"oping, Sweden, to compare the performance of state-of-the-art deep learning architectures in building energy forecasting.

MoDELS · 核化 · Obvious · 相關系數 · 值域 ·

2023 年 5 月 9 日

Comparing Foundation Models using Data Kernels

Brandon Duderstadt,Hayden S. Helm,Carey E. Priebe

Recent advances in self-supervised learning and neural network scaling have enabled the creation of large models -- known as foundation models -- which can be easily adapted to a wide range of downstream tasks. The current paradigm for comparing foundation models involves benchmarking them with aggregate metrics on various curated datasets. Unfortunately, this method of model comparison is heavily dependent on the choice of metric, which makes it unsuitable for situations where the ideal metric is either not obvious or unavailable. In this work, we present a metric-free methodology for comparing foundation models via their embedding space geometry. Our methodology is grounded in random graph theory, and facilitates both pointwise and multi-model comparison. Further, we demonstrate how our framework can be used to induce a manifold of models equipped with a distance function that correlates strongly with several downstream metrics.

信念傳播 · Color · 數據集 · MoDELS · 推斷 ·

2023 年 5 月 8 日

Large-scale and Efficient Texture Mapping Algorithm via Loopy Belief Propagation

Xiao ling,Rongjun Qin

from arxiv, 13 Figures

Texture mapping as a fundamental task in 3D modeling has been well established for well-acquired aerial assets under consistent illumination, yet it remains a challenge when it is scaled to large datasets with images under varying views and illuminations. A well-performed texture mapping algorithm must be able to efficiently select views, fuse and map textures from these views to mesh models, at the same time, achieve consistent radiometry over the entire model. Existing approaches achieve efficiency either by limiting the number of images to one view per face, or simplifying global inferences to only achieve local color consistency. In this paper, we break this tie by proposing a novel and efficient texture mapping framework that allows the use of multiple views of texture per face, at the same time to achieve global color consistency. The proposed method leverages a loopy belief propagation algorithm to perform an efficient and global-level probabilistic inferences to rank candidate views per face, which enables face-level multi-view texture fusion and blending. The texture fusion algorithm, being non-parametric, brings another advantage over typical parametric post color correction methods, due to its improved robustness to non-linear illumination differences. The experiments on three different types of datasets (i.e. satellite dataset, unmanned-aerial vehicle dataset and close-range dataset) show that the proposed method has produced visually pleasant and texturally consistent results in all scenarios, with an added advantage of consuming less running time as compared to the state of the art methods, especially for large-scale dataset such as satellite-derived models.

奇異的 · 線性的 · 正則化項 · 類別 · 馬爾可夫過程 ·

2023 年 5 月 8 日

Additional results on convergence and semiconvergence of three-step alternating iteration scheme for singular linear systems

Vaibhav Shekhar,Punit Sharma

The three-step alternating iteration scheme for finding an iterative solution of a singular (non-singular) linear systems in a faster way was introduced by Nandi {\it et al.} [Numer. Algorithms; 84 (2) (2020) 457-483], recently. The authors then provided its convergence criteria for a class of matrix splitting called proper G-weak regular splittings of type I. In this note, we analyze further the convergence criteria of the same scheme. In this aspect, we obtain sufficient conditions for the convergence of the same scheme for another class of matrix splittings called proper G-weak regular splittings of type II. We then show that this scheme converges faster than the two-step alternating and usual iteration schemes, even for this class of splittings. As a particular case, we also establish faster convergence criteria of three-step in a nonsingular matrix setting. This is shown that a large amount of computational time and memory are required in single-step and two-step alternating iterative methods to solve the nonsingular linear systems more efficiently than the three-step alternating iteration method. Finally, the semiconvergence of a three-step alternating iterative scheme is established. Its faster semiconvergence is demonstrated by considering a singular linear system arising from the Markov process.

MoDELS · 狀態空間 · state-of-the-art · 缺失值 · Performer ·

2023 年 5 月 6 日

Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models

Juan Miguel Lopez Alcaraz,Nils Strodthoff

from arxiv, V3: Updated results for the solar dataset. 36 pages, 13 figures. Version published by Transactions on Machine Learning Research in 2022 (TMLR ISSN 2835-8856) //openreview.net/forum?id=hHiIbk7ApW. Source code under //github.com/AI4HealthUOL/SSSD

The imputation of missing values represents a significant obstacle for many real-world data analysis pipelines. Here, we focus on time series data and put forward SSSD, an imputation model that relies on two emerging technologies, (conditional) diffusion models as state-of-the-art generative models and structured state space models as internal model architecture, which are particularly suited to capture long-term dependencies in time series data. We demonstrate that SSSD matches or even exceeds state-of-the-art probabilistic imputation and forecasting performance on a broad range of data sets and different missingness scenarios, including the challenging blackout-missing scenarios, where prior approaches failed to provide meaningful results.

CoT · PHP · 語言模型化 · Prompt · state-of-the-art ·

2023 年 5 月 5 日

Progressive-Hint Prompting Improves Reasoning in Large Language Models

Chuanyang Zheng,Zhengying Liu,Enze Xie,Zhenguo Li,Yu Li

from arxiv, Tech Report

The performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that enhance this ability. However, these methods do not fully exploit the answers generated by the LLM to guide subsequent responses. This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP), that enables automatic multiple interactions between users and LLMs by using previously generated answers as hints to progressively guide toward the correct answers. PHP is orthogonal to CoT and self-consistency, making it easy to combine with state-of-the-art techniques to further improve performance. We conducted an extensive and comprehensive evaluation to demonstrate the effectiveness of the proposed method. Our experimental results on six benchmarks show that combining CoT and self-consistency with PHP significantly improves accuracy while remaining highly efficient. For instance, with text-davinci-003, we observed a 4.2% improvement on GSM8K with greedy decoding compared to Complex CoT, and a 46.17% reduction in sample paths with self-consistency. With GPT-4 and PHP, we achieve state-of-the-art performances on SVAMP (89.1% -> 91.9%), GSM8K (92% -> 95.5%), AQuA (76.4% -> 79.9%) and MATH (50.2% -> 53.9%).

自助法/自舉法 · FAST · Networking · Neural Networks · Learning ·

2023 年 5 月 4 日

A Bootstrap Algorithm for Fast Supervised Learning

Michael A Kouritzin,Stephen Styles,Beatrice-Helen Vritsiou

from arxiv, 16 pages

Training a neural network (NN) typically relies on some type of curve-following method, such as gradient descent (GD) (and stochastic gradient descent (SGD)), ADADELTA, ADAM or limited memory algorithms. Convergence for these algorithms usually relies on having access to a large quantity of observations in order to achieve a high level of accuracy and, with certain classes of functions, these algorithms could take multiple epochs of data points to catch on. Herein, a different technique with the potential of achieving dramatically better speeds of convergence, especially for shallow networks, is explored: it does not curve-follow but rather relies on 'decoupling' hidden layers and on updating their weighted connections through bootstrapping, resampling and linear regression. By utilizing resampled observations, the convergence of this process is empirically shown to be remarkably fast and to require a lower amount of data points: in particular, our experiments show that one needs a fraction of the observations that are required with traditional neural network training methods to approximate various classes of functions.

圖片分類 · 前饋網絡 · INTERACT · Networking · 前饋 ·

2021 年 5 月 7 日

ResMLP: Feedforward networks for image classification with data-efficient training

Hugo Touvron,Piotr Bojanowski,Mathilde Caron,Matthieu Cord,Alaaeldin El-Nouby,Edouard Grave,Armand Joulin,Gabriel Synnaeve,Jakob Verbeek,Hervé Jégou

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.