This survey is meant to provide an introduction to linear models and the theories behind them. Our goal is to give a rigorous introduction to the readers with prior exposure to ordinary least squares. In machine learning, the output is usually a nonlinear function of the input. Deep learning even aims to find a nonlinear dependence with many layers which require a large amount of computation. However, most of these algorithms build upon simple linear models. We then describe linear models from different views and find the properties and theories behind the models. The linear model is the main technique in regression problems and the primary tool for it is the least squares approximation which minimizes a sum of squared errors. This is a natural choice when we're interested in finding the regression function which minimizes the corresponding expected squared error. This survey is primarily a summary of purpose, significance of important theories behind linear models, e.g., distribution theory, minimum variance estimator. We first describe ordinary least squares from three different points of view upon which we disturb the model with random noise and Gaussian noise. By Gaussian noise, the model gives rise to the likelihood so that we introduce a maximum likelihood estimator. It also develops some distribution theories via this Gaussian disturbance. The distribution theory of least squares will help us answer various questions and introduce related applications. We then prove least squares is the best unbiased linear model in the sense of mean squared error and most importantly, it actually approaches the theoretical limit. We end up with linear models with the Bayesian approach and beyond.
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.
Dynamic time warping distance (DTW) is a widely used distance measure between time series $x, y \in \Sigma^n$. It was shown by Abboud, Backurs, and Williams that in the \emph{binary case}, where $|\Sigma| = 2$, DTW can be computed in time $O(n^{1.87})$. We improve this running time $O(n)$. Moreover, if $x$ and $y$ are run-length encoded, then there is an algorithm running in time $\tilde{O}(k + \ell)$, where $k$ and $\ell$ are the number of runs in $x$ and $y$, respectively. This improves on the previous best bound of $O(k\ell)$ due to Dupont and Marteau.
Deep learning has transformed the way we think of software and what it can do. But deep neural networks are fragile and their behaviors are often surprising. In many settings, we need to provide formal guarantees on the safety, security, correctness, or robustness of neural networks. This book covers foundational ideas from formal verification and their adaptation to reasoning about neural networks and deep learning.
Variable selection is commonly used to arrive at a parsimonious model when relating an outcome to high-dimensional covariates. Oftentimes a selection rule that prescribes the permissible variable combinations in the final model is desirable due to the inherent structural constraints among the candidate variables. Penalized regression methods can integrate these restrictions (which we call "selection rules") by assigning the covariates to different (possibly overlapping) groups and then applying different penalties to the groups of variables. However, no general framework has yet been proposed to formalize selection rules and their application. In this work, we develop a mathematical language for constructing selection rules in variable selection, where the resulting combination of permissible sets of selected covariates, called a "selection dictionary", is formally defined. We show that all selection rules can be represented as a combination of operations on constructs, which we refer to as "unit rules", and these can be used to identify the related selection dictionary. One may then apply some criteria to select the best model. We also present a necessary and sufficient condition for a grouping structure used with (latent) overlapping group Lasso to carry out variable selection under an arbitrary selection rule.
Unlike its intercept, a linear classifier's weight vector cannot be tuned by a simple grid search. Hence, this paper proposes weight vector tuning of a generic binary linear classifier through the parameterization of a decomposition of the discriminant by a scalar which controls the trade-off between conflicting informative and noisy terms. By varying this parameter, the original weight vector is modified in a meaningful way. Applying this method to a number of linear classifiers under a variety of data dimensionality and sample size settings reveals that the classification performance loss due to non-optimal native hyperparameters can be compensated for by weight vector tuning. This yields computational savings as the proposed tuning method reduces to tuning a scalar compared to tuning the native hyperparameter, which may involve repeated weight vector generation along with its burden of optimization, dimensionality reduction, etc., depending on the classifier. It is also found that weight vector tuning significantly improves the performance of Linear Discriminant Analysis (LDA) under high estimation noise. Proceeding from this second finding, an asymptotic study of the misclassification probability of the parameterized LDA classifier in the growth regime where the data dimensionality and sample size are comparable is conducted. Using random matrix theory, the misclassification probability is shown to converge to a quantity that is a function of the true statistics of the data. Additionally, an estimator of the misclassification probability is derived. Finally, computationally efficient tuning of the parameter using this estimator is demonstrated on real data.
In this paper, derivation of different forms of dynamic formulation of spherical parallel robots (SPRs) is investigated. These formulations include the explicit dynamic forms, linear regressor, and Slotine-Li (SL) regressor, which are required for the design and implementation of the vast majority of model-based controllers and dynamic parameters identification schemes. To this end, the implicit dynamic of SPRs is first formulated using the principle of virtual work in task-space, and then by using an extension, their explicit dynamic formulation is derived. The dynamic equation is then analytically reformulated into linear and S-L regression form with respect to the inertial parameters, and by using the Gauss-Jordan procedure, it is reduced to a unique and closed-form structure. Finally, to illustrate the effectiveness of the proposed method, two different SPRs, namely, the ARAS-Diamond, and the 3-RRR, are examined as the case studies. The obtained results are verified by using the MSC-ADAMS software, and are shared to interested audience for public access.
Mars has been a prime candidate for planetary exploration of the solar system because of the science discoveries that support chances of future habitation on this planet. Martian caves and lava tubes like terrains, which consists of uneven ground, poor visibility and confined space, makes it impossible for wheel based rovers to navigate through these areas. In order to address these limitations and advance the exploration capability in a Martian terrain, this article presents the design and control of a novel coaxial quadrotor Micro Aerial Vehicle (MAV). As it will be presented, the key contributions on the design and control architecture of the proposed Mars coaxial quadrotor, are introducing an alternative and more enhanced, from a control point of view concept, when compared in terms of autonomy to Ingenuity. Based on the presented design, the article will introduce the mathematical modelling and automatic control framework of the vehicle that will consist of a linearised model of a co-axial quadrotor and a corresponding Model Predictive Controller (MPC) for the trajectory tracking. Among the many models, proposed for the aerial flight on Mars, a reliable control architecture lacks in the related state of the art. The MPC based closed loop responses of the proposed MAV will be verified in different conditions during the flight with additional disturbances, induced to replicate a real flight scenario. In order to further validate the proposed control architecture and prove the efficacy of the suggested design, the introduced Mars coaxial quadrotor and the MPC scheme will be compared to a PID-type controller, similar to the Ingenuity helicopter's control architecture for the position and the heading.
This paper proposes an algorithm to estimate the parameters of a censored linear regression model when the regression errors are autocorrelated, and the innovations follow a Student-$t$ distribution. The Student-$t$ distribution is widely used in statistical modeling of datasets involving errors with outliers and a more substantial possibility of extreme values. The maximum likelihood (ML) estimates are obtained throughout the SAEM algorithm [1]. This algorithm is a stochastic approximation of the EM algorithm, and it is a tool for models in which the E-step does not have an analytic form. There are also provided expressions to compute the observed Fisher information matrix [2]. The proposed model is illustrated by the analysis of a real dataset that has left-censored and missing observations. We also conducted two simulations studies to examine the asymptotic properties of the estimates and the robustness of the model.
In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the proofs have been carefully chosen to be as simple and as short as possible.
This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and controls might be combined to approach these challenges.