曰本中文字幕一区二区三区高清,欧美精品一区二区视频在线观看,来个一级毛片一级毛片一级毛片,国产亚洲欧美在线观看的,韩国一区二区三区在线视频

Datasets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A,B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features that have significant aggregate correlation with the features in B, and vice-versa. This paper proposes an iterative-testing based bimodule search procedure (BSP) to identify stable bimodules. Compared to existing methods for detecting cross-correlated features, BSP was the best at recovering true bimodules with sufficient signal, while limiting the false discoveries. In addition, we applied BSP to the problem of expression quantitative trait loci (eQTL) analysis using data from the GTEx consortium. BSP identified several thousand SNP-gene bimodules. While many of the individual SNP-gene pairs appearing in the discovered bimodules were identified by standard eQTL methods, the discovered bimodules revealed genomic subnetworks that appeared to be biologically meaningful and worthy of further scientific investigation.

相關內容

可辨認的

關注 4

MoDELS · 統計量 · 置信度 · GROUP · 推斷 ·

2024 年 6 月 21 日

The Influence of Nuisance Parameter Uncertainty on Statistical Inference in Practical Data Science Models

Yunrong Wan,James Spall

For multiple reasons -- such as avoiding overtraining from one data set or because of having received numerical estimates for some parameters in a model from an alternative source -- it is sometimes useful to divide a model's parameters into one group of primary parameters and one group of nuisance parameters. However, uncertainty in the values of nuisance parameters is an inevitable factor that impacts the model's reliability. This paper examines the issue of uncertainty calculation for primary parameters of interest in the presence of nuisance parameters. We illustrate a general procedure on two distinct model forms: 1) the GARCH time series model with univariate nuisance parameter and 2) multiple hidden layer feed-forward neural network models with multivariate nuisance parameters. Leveraging an existing theoretical framework for nuisance parameter uncertainty, we show how to modify the confidence regions for the primary parameters while considering the inherent uncertainty introduced by nuisance parameters. Furthermore, our study validates the practical effectiveness of adjusted confidence regions that properly account for uncertainty in nuisance parameters. Such an adjustment helps data scientists produce results that more honestly reflect the overall uncertainty.

線性的 · 工商管理碩士（MBA） · 相似度 · 變換 · 類別 ·

2024 年 6 月 21 日

Deobfuscation of Semi-Linear Mixed Boolean-Arithmetic Expressions

Colton Skees

from arxiv, - Note that Zhou et al's 2007 paper was first to prove the N-bit to 1-bit transform, as opposed to Liu et al. in 2021 - Update email href - Apply constant propagation over multiplication by two constants before computing # of nodes - Change example MBA used in the introduction section

Mixed Boolean-Arithmetic (MBA) obfuscation is a common technique used to transform simple expressions into semantically equivalent but more complex combinations of boolean and arithmetic operators. Its widespread usage in DRM systems, malware, and software protectors is well documented. In 2021, Liu et al. proposed a groundbreaking method of simplifying linear MBAs, utilizing a hidden two-way transformation between 1-bit and n-bit variables. In 2022, Reichenwallner et al. proposed a similar but more effective method of simplifying linear MBAs, SiMBA, relying on a similar but more involved theorem. However, because current linear MBA simplifiers operate in 1-bit space, they cannot handle expressions which utilize constants inside of their bitwise operands, e.g. (x&1), (x&1111) + (y&1111). We propose an extension to SiMBA that enables simplification of this broader class of expressions. It surpasses peer tools, achieving efficient simplification of a class of MBAs that current simplifiers struggle with.

優化器 · FAST · Performer · Minimax · state-of-the-art ·

2024 年 6 月 20 日

Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods

Gen Li,Yanxi Chen,Yu Huang,Yuejie Chi,H. Vincent Poor,Yuxin Chen

Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.

樣例 · 統計量 · 語言模型化 · Learning · Analysis ·

2024 年 6 月 20 日

Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example Associations

Xisen Jin,Xiang Ren

from arxiv, 5 pages

Language models (LMs) are known to suffer from forgetting of previously learned examples when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: //inklab.usc.edu/lm-forgetting-prediction/

Shark · 哈希學習 · 不可約的 · 對角矩陣 · 分離的 ·

2024 年 6 月 18 日

A Characterization of Semi-Involutory MDS Matrices

Tapas Chatterjee,Ayantika Laha

from arxiv, 19 pages

In symmetric cryptography, maximum distance separable (MDS) matrices with computationally simple inverses have wide applications. Many block ciphers like AES, SQUARE, SHARK, and hash functions like PHOTON use an MDS matrix in the diffusion layer. In this article, we first characterize all $3 \times 3$ irreducible semi-involutory matrices over the finite field of characteristic $2$. Using this matrix characterization, we provide a necessary and sufficient condition to construct MDS semi-involutory matrices using only their diagonal entries and the entries of an associated diagonal matrix. Finally, we count the number of $3 \times 3$ semi-involutory MDS matrices over any finite field of characteristic $2$.

泛函 · 塑造 · MoDELS · 廣義函數 · 估計/估計量 ·

2024 年 6 月 18 日

Intrinsic Modeling of Shape-Constrained Functional Data, With Applications to Growth Curves and Activity Profiles

Poorbita Kundu,Hans-Georg Müller

Shape-constrained functional data encompass a wide array of application fields especially in the life sciences, such as activity profiling, growth curves, healthcare and mortality. Most existing methods for general functional data analysis often ignore that such data are subject to inherent shape constraints, while some specialized techniques rely on strict distributional assumptions. We propose an approach for modeling such data that harnesses the intrinsic geometry of functional trajectories by decomposing them into size and shape components. We focus on the two most prevalent shape constraints, positivity and monotonicity, and develop individual-level estimators for the size and shape components. Furthermore, we demonstrate the applicability of our approach by conducting subsequent analyses involving Fr\'{e}chet mean and Fr\'{e}chet regression and establish rates of convergence for the empirical estimators. Illustrative examples include simulations and data applications for activity profiles for Mediterranean fruit flies during their entire lifespan and for data from the Z\"{u}rich longitudinal growth study.

MoDELS · 可行 · 模態 · 多樣性 · Learning ·

2024 年 6 月 17 日

Feasibility of Federated Learning from Client Databases with Different Brain Diseases and MRI Modalities

Felix Wagner,Wentian Xu,Pramit Saha,Ziyun Liang,Daniel Whitehouse,David Menon,Natalie Voets,J. Alison Noble,Konstantinos Kamnitsas

Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous databases that may contain scans and segmentation labels for different types of brain pathologies and diverse sets of MRI modalities. Is it feasible to use Federated Learning (FL) for training a single model on client databases that contain scans and labels of different brain pathologies and diverse sets of MRI modalities? We demonstrate promising results by combining appropriate, simple, and practical modifications to the model and training strategy: Designing a model with input channels that cover the whole set of modalities available across clients, training with random modality drop, and exploring the effects of feature normalization methods. Evaluation on 7 brain MRI databases with 5 different diseases shows that such FL framework can train a single model that is shown to be very promising in segmenting all disease types seen during training. Importantly, it is able to segment these diseases in new databases that contain sets of modalities different from those in training clients. These results demonstrate, for the first time, feasibility and effectiveness of using FL to train a single segmentation model on decentralised data with diverse brain diseases and MRI modalities, a necessary step towards leveraging heterogeneous real-world databases. Code will be made available at: //github.com/FelixWag/FL-MultiDisease-MRI

MoDELS · 黑盒 · 獎勵函數 · Extensibility · Stable Diffusion ·

2024 年 6 月 16 日

Improving GFlowNets for Text-to-Image Diffusion Alignment

Dinghuai Zhang,Yizhe Zhang,Jiatao Gu,Ruixiang Zhang,Josh Susskind,Navdeep Jaitly,Shuangfei Zhai

Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples. In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability -- a natural scenario for the framework of generative flow networks (GFlowNets). To this end, we propose the Diffusion Alignment with GFlowNet (DAG) algorithm to post-train diffusion models with black-box property functions. Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.

估計/估計量 · 優化器 · 平滑 · Minimax · Lipschitz ·

2024 年 6 月 16 日

Plugin Estimation of Smooth Optimal Transport Maps

Tudor Manole,Sivaraman Balakrishnan,Jonathan Niles-Weed,Larry Wasserman

from arxiv, To appear in the Annals of Statistics

We analyze a number of natural estimators for the optimal transport map between two distributions and show that they are minimax optimal. We adopt the plugin approach: our estimators are simply optimal couplings between measures derived from our observations, appropriately extended so that they define functions on $\mathbb{R}^d$. When the underlying map is assumed to be Lipschitz, we show that computing the optimal coupling between the empirical measures, and extending it using linear smoothers, already gives a minimax optimal estimator. When the underlying map enjoys higher regularity, we show that the optimal coupling between appropriate nonparametric density estimates yields faster rates. Our work also provides new bounds on the risk of corresponding plugin estimators for the quadratic Wasserstein distance, and we show how this problem relates to that of estimating optimal transport maps using stability arguments for smooth and strongly convex Brenier potentials. As an application of our results, we derive central limit theorems for plugin estimators of the squared Wasserstein distance, which are centered at their population counterpart when the underlying distributions have sufficiently smooth densities. In contrast to known central limit theorems for empirical estimators, this result easily lends itself to statistical inference for the quadratic Wasserstein distance.

MoDELS · 語言模型化 · Learning · 逼真度 · 講稿 ·

2024 年 4 月 11 日

Best Practices and Lessons Learned on Synthetic Data for Language Models

Ruibo Liu,Jerry Wei,Fangyu Liu,Chenglei Si,Yanzhe Zhang,Jinmeng Rao,Steven Zheng,Daiyi Peng,Diyi Yang,Denny Zhou,Andrew M. Dai

The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.