We propose an iterative variable selection method for the accelerated failure time model using high-dimensional survival data. Our method pioneers the use of the recently proposed structured screen-and-select framework for survival analysis. We use the marginal utility as the measure of association to inform the structured screening process. For the selection steps, we use Bayesian model selection based on non-local priors. We compare the proposed method with a few well-known methods. Assessment in terms of true positive rate and false discovery rate shows the usefulness of our method. We have implemented the method within the R package GWASinlps.
Automated variable selection is widely applied in statistical model development. Algorithms like forward, backward or stepwise selection are available in statistical software packages like R and SAS. Many researchers have criticized the use of these algorithms because the models resulting from automated selection algorithms are not based on theory and tend to be unstable. Furthermore, simulation studies have shown that they often select incorrect variables due to random effects which makes these model building strategies unreliable. In this article, a comprehensive stepwise selection algorithm tailored to logistic regression is proposed. It uses multiple criteria in variable selection instead of relying on one single measure only, like a $p$-value or Akaike's information criterion, which ensures robustness and soundness of the final outcome. The result of the selection process might not be unambiguous. It might select multiple models that could be considered as statistically equivalent. A simulation study demonstrates the superiority of the proposed variable selection method over available alternatives.
Specifications of complex, large scale, computer software and hardware systems can be radically simplified by using simple maps from input sequences to output values. These "state machine maps" provide an alternative representation of classical Moore type state machines. Composition of state machine maps corresponds to state machine products and can be used to specify essentially any type of interconnection as well as parallel and distributed computation. State machine maps can also specify abstract properties of systems and are significantly more concise and scalable than traditional representations of automata. Examples included here include specifications of producer/consumer software, network distributed consensus, real-time digital circuits, and operating system scheduling. The motivation for this work comes from experience designing and developing operating systems and real-time software where weak methods for understanding and exploring designs is a well known handicap. The methods introduced here are based on ordinary discrete mathematics, primitive recursive functions and deterministic state machines and are intended, initially, to aid the intuition and understanding of the system developers. Staying strictly within the boundaries of classical deterministic state machines anchors the methods to the algebraic structures of automata and semigroups, obviates any need for axiomatic deduction systems, "formal methods", or extensions to the model, and makes the specifications more faithful to engineering practice. While state machine maps are obvious representations of state machines, the techniques introduced here for defining and composing them are novel. To illustrate applications, the paper includes a fairly detailed specification and verification of the well-known "Paxos" distributed consensus algorithm plus a number of simpler examples including a digital PID controller.
Image classification is one of the most fundamental tasks in Computer Vision. In practical applications, the datasets are usually not as abundant as those in the laboratory and simulation, which is always called as Data Hungry. How to extract the information of data more completely and effectively is very important. Therefore, an Adaptive Data Augmentation Framework based on the tensor T-product Operator is proposed in this paper, to triple one image data to be trained and gain the result from all these three images together with only less than 0.1% increase in the number of parameters. At the same time, this framework serves the functions of column image embedding and global feature intersection, enabling the model to obtain information in not only spatial but frequency domain, and thus improving the prediction accuracy of the model. Numerical experiments have been designed for several models, and the results demonstrate the effectiveness of this adaptive framework. Numerical experiments show that our data augmentation framework can improve the performance of original neural network model by 2%, which provides competitive results to state-of-the-art methods.
Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
This paper is motivated by medical studies in which the same patients with multiple sclerosis are examined at several successive visits and described by fractional anisotropy tract profiles, which can be represented as functions. Since the observations for each patient are dependent random processes, they follow a repeated measures design for functional data. To compare the results for different visits, we thus consider functional repeated measures analysis of variance. For this purpose, a pointwise test statistic is constructed by adapting the classical test statistic for one-way repeated measures analysis of variance to the functional data framework. By integrating and taking the supremum of the pointwise test statistic, we create two global test statistics. Apart from verifying the general null hypothesis on the equality of mean functions corresponding to different objects, we also propose a simple method for post hoc analysis. We illustrate the finite sample properties of permutation and bootstrap testing procedures in an extensive simulation study. Finally, we analyze a motivating real data example in detail.
In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field of view (FOV), resampling distortion, and sensitivity to calibration errors. The effects are particularly pronounced in case of significant distortion (e.g., wide-angle fisheye cameras). In this paper, we propose a generic scale-aware self-supervised pipeline for estimating depth, euclidean distance, and visual odometry from unrectified monocular videos. We demonstrate a similar level of precision on the unrectified KITTI dataset with barrel distortion comparable to the rectified KITTI dataset. The intuition being that the rectification step can be implicitly absorbed within the CNN model, which learns the distortion model without increasing complexity. Our approach does not suffer from a reduced field of view and avoids computational costs for rectification at inference time. To further illustrate the general applicability of the proposed framework, we apply it to wide-angle fisheye cameras with 190$^\circ$ horizontal field of view. The training framework UnRectDepthNet takes in the camera distortion model as an argument and adapts projection and unprojection functions accordingly. The proposed algorithm is evaluated further on the KITTI rectified dataset, and we achieve state-of-the-art results that improve upon our previous work FisheyeDistanceNet. Qualitative results on a distorted test scene video sequence indicate excellent performance //youtu.be/K6pbx3bU4Ss.
Including a large number of predictors in the imputation model underlying a multiple imputation (MI) procedure is one of the most challenging tasks imputers face. A variety of high-dimensional MI techniques can help, but there has been limited research on their relative performance. In this study, we investigated a wide range of extant high-dimensional MI techniques that can handle a large number of predictors in the imputation models and general missing data patterns. We assessed the relative performance of seven high-dimensional MI methods with a Monte Carlo simulation study and a resampling study based on real survey data. The performance of the methods was defined by the degree to which they facilitate unbiased and confidencevalid estimates of the parameters of complete data analysis models. We found that using lasso penalty or forward selection to select the predictors used in the MI model and using principal component analysis to reduce the dimensionality of auxiliary data produce the best results.
Matrix factor model is drawing growing attention for simultaneous two-way dimension reduction of well-structured matrix-valued observations. This paper focuses on robust statistical inference for matrix factor model in the ``diverging dimension" regime. We derive the convergence rates of the robust estimators for loadings, factors and common components under finite second moment assumption of the idiosyncratic errors. In addition, the asymptotic distributions of the estimators are also derived under mild conditions. We propose a rank minimization and an eigenvalue-ratio method to estimate the pair of factor numbers consistently. Numerical studies confirm the iterative Huber regression algorithm is a practical and reliable approach for the estimation of matrix factor model, especially under the cases with heavy-tailed idiosyncratic errors . We illustrate the practical usefulness of the proposed methods by two real datasets, one on financial portfolios and one on the macroeconomic indices of China.
Knowledge graphs represent factual knowledge about the world as relationships between concepts and are critical for intelligent decision making in enterprise applications. New knowledge is inferred from the existing facts in the knowledge graphs by encoding the concepts and relations into low-dimensional feature vector representations. The most effective representations for this task, called Knowledge Graph Embeddings (KGE), are learned through neural network architectures. Due to their impressive predictive performance, they are increasingly used in high-impact domains like healthcare, finance and education. However, are the black-box KGE models adversarially robust for use in domains with high stakes? This thesis argues that state-of-the-art KGE models are vulnerable to data poisoning attacks, that is, their predictive performance can be degraded by systematically crafted perturbations to the training knowledge graph. To support this argument, two novel data poisoning attacks are proposed that craft input deletions or additions at training time to subvert the learned model's performance at inference time. These adversarial attacks target the task of predicting the missing facts in knowledge graphs using KGE models, and the evaluation shows that the simpler attacks are competitive with or outperform the computationally expensive ones. The thesis contributions not only highlight and provide an opportunity to fix the security vulnerabilities of KGE models, but also help to understand the black-box predictive behaviour of KGE models.
Federated learning enables multiple parties to collaboratively train a machine learning model without communicating their local data. A key challenge in federated learning is to handle the heterogeneity of local data distribution across parties. Although many studies have been proposed to address this challenge, we find that they fail to achieve high performance in image datasets with deep learning models. In this paper, we propose MOON: model-contrastive federated learning. MOON is a simple and effective federated learning framework. The key idea of MOON is to utilize the similarity between model representations to correct the local training of individual parties, i.e., conducting contrastive learning in model-level. Our extensive experiments show that MOON significantly outperforms the other state-of-the-art federated learning algorithms on various image classification tasks.