In this paper we focus on three major task: 1) discussing our methods: Our method captures a portion of the data in DCD flowsheets, kidney perfusion data, and Flowsheet data captured peri-organ recovery surgery. 2) demonstrating the result: We built a comprehensive, analyzable database from 2022 OPTN data. This dataset is by far larger than any previously available even in this preliminary phase; and 3) proving that our methods can be extended to all the past OPTN data and future data. The scope of our study is all Organ Procurement and Transplantation Network (OPTN) data of the USA organ donors since 2008. The data was not analyzable in a large scale in the past because it was captured in PDF documents known as ``Attachments'', whereby every donor's information was recorded into dozens of PDF documents in heterogeneous formats. To make the data analyzable, one needs to convert the content inside these PDFs to an analyzable data format, such as a standard SQL database. In this paper we will focus on 2022 OPTN data, which consists of $\approx 400,000$ PDF documents spanning millions of pages. The entire OPTN data covers 15 years (2008--20022). This paper assumes that readers are familiar with the content of the OPTN data.
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.
In this paper, we study the exact recovery problem in the Gaussian weighted version of the Stochastic block model with two symmetric communities. We provide the information-theoretic threshold in terms of the signal-to-noise ratio (SNR) of the model and prove that when SNR $<1$, no statistical estimator can exactly recover the community structure with probability bounded away from zero. On the other hand, we show that when SNR $>1$, the Maximum likelihood estimator itself succeeds in exactly recovering the community structure with probability approaching one. Then, we provide two algorithms for achieving exact recovery. The Semi-definite relaxation as well as the spectral relaxation of the Maximum likelihood estimator can recover the community structure down to the threshold value of 1, establishing the absence of an information-computation gap for this model. Next, we compare the problem of community detection with the problem of recovering a planted densely weighted community within a graph and prove that the exact recovery of two symmetric communities is a strictly easier problem than recovering a planted dense subgraph of size half the total number of nodes, by establishing that when the same SNR$< 3/2$, no statistical estimator can exactly recover the planted community with probability bounded away from zero. More precisely, when $1 <$ SNR $< 3/2$ exact recovery of community detection is possible, both statistically and algorithmically, but it is impossible to exactly recover the planted community, even statistically, in the Gaussian weighted model. Finally, we show that when SNR $>2$, the Maximum likelihood estimator itself succeeds in exactly recovering the planted community with probability approaching one. We also prove that the Semi-definite relaxation of the Maximum likelihood estimator can recover the planted community structure down to the threshold value of 2.
In this paper, a fifth-order moment-based Hermite weighted essentially non-oscillatory scheme with unified stencils (termed as HWENO-U) is proposed for hyperbolic conservation laws. The main idea of the HWENO-U scheme is to modify the first-order moment by a HWENO limiter only in the time discretizations using the same information of spatial reconstructions, in which the limiter not only overcomes spurious oscillations well, but also ensures the stability of the fully-discrete scheme. For the HWENO reconstructions, a new scale-invariant nonlinear weight is designed by incorporating only the integral average values of the solution, which keeps all properties of the original one while is more robust for simulating challenging problems with sharp scale variations. Compared with previous HWENO schemes, the advantages of the HWENO-U scheme are: (1) a simpler implemented process involving only a single HWENO reconstruction applied throughout the entire procedures without any modifications for the governing equations; (2) increased efficiency by utilizing the same candidate stencils, reconstructed polynomials, and linear and nonlinear weights in both the HWENO limiter and spatial reconstructions; (3) reduced problem-specific dependencies and improved rationality, as the nonlinear weights are identical for the function $u$ and its non-zero multiple $\zeta u$. Besides, the proposed scheme retains the advantages of previous HWENO schemes, including compact reconstructed stencils and the utilization of artificial linear weights. Extensive benchmarks are carried out to validate the accuracy, efficiency, resolution, and robustness of the proposed scheme.
In this paper, we apply quasi-Monte Carlo (QMC) methods with an initial preintegration step to estimate cumulative distribution functions and probability density functions in uncertainty quantification (UQ). The distribution and density functions correspond to a quantity of interest involving the solution to an elliptic partial differential equation (PDE) with a lognormally distributed coefficient and a normally distributed source term. There is extensive previous work on using QMC to compute expected values in UQ, which have proven very successful in tackling a range of different PDE problems. However, the use of QMC for density estimation applied to UQ problems will be explored here for the first time. Density estimation presents a more difficult challenge compared to computing the expected value due to discontinuities present in the integral formulations of both the distribution and density. Our strategy is to use preintegration to eliminate the discontinuity by integrating out a carefully selected random parameter, so that QMC can be used to approximate the remaining integral. First, we establish regularity results for the PDE quantity of interest that are required for smoothing by preintegration to be effective. We then show that an $N$-point lattice rule can be constructed for the integrands corresponding to the distribution and density, such that after preintegration the QMC error is of order $\mathcal{O}(N^{-1+\epsilon})$ for arbitrarily small $\epsilon>0$. This is the same rate achieved for computing the expected value of the quantity of interest. Numerical results are presented to reaffirm our theory.
Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.
Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.
In this paper, we propose and analyze two different stream ciphers based on a Skew Tent Map and a Modified Logistic Map respectively. In order to improve the randomness of these systems, a single method for increasing the period length of the generated sequences has been applied. The results prove that the randomness of these systems can be severally increased by using this method, making these systems suitable for secure communications.
We propose a simple empirical representation of expectations such that: For a number of samples above a certain threshold, drawn from any probability distribution with finite fourth-order statistic, the proposed estimator outperforms the empirical average when tested against the actual population, with respect to the quadratic loss. For datasets smaller than this threshold, the result still holds, but for a class of distributions determined by their first four statistics. Our approach leverages the duality between distributionally robust and risk-averse optimization.
Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at //github.com/Albert-Ma/PROP.
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.