Economists frequently estimate average treatment effects (ATEs) for transformations of the outcome that are well-defined at zero but behave like $\log(y)$ when $y$ is large (e.g., $\log(1+y)$, $\mathrm{arcsinh}(y)$). We show that these ATEs depend arbitrarily on the units of the outcome, and thus cannot be interpreted as percentage effects. Moreover, we prove that when the outcome can equal zero, there is no parameter of the form $E_P[g(Y(1),Y(0))]$ that is point-identified and unit-invariant. We discuss sensible alternative target parameters for settings with zero-valued outcomes that relax at least one of these requirements.
Past work exploring adversarial vulnerability have focused on situations where an adversary can perturb all dimensions of model input. On the other hand, a range of recent works consider the case where either (i) an adversary can perturb a limited number of input parameters or (ii) a subset of modalities in a multimodal problem. In both of these cases, adversarial examples are effectively constrained to a subspace $V$ in the ambient input space $\mathcal{X}$. Motivated by this, in this work we investigate how adversarial vulnerability depends on $\dim(V)$. In particular, we show that the adversarial success of standard PGD attacks with $\ell^p$ norm constraints behaves like a monotonically increasing function of $\epsilon (\frac{\dim(V)}{\dim \mathcal{X}})^{\frac{1}{q}}$ where $\epsilon$ is the perturbation budget and $\frac{1}{p} + \frac{1}{q} =1$, provided $p > 1$ (the case $p=1$ presents additional subtleties which we analyze in some detail). This functional form can be easily derived from a simple toy linear model, and as such our results land further credence to arguments that adversarial examples are endemic to locally linear models on high dimensional spaces.
Understanding variable dependence, particularly eliciting their statistical properties given a set of covariates, provides the mathematical foundation in practical operations management such as risk analysis and decision making given observed circumstances. This article presents an estimation method for modeling the conditional joint distribution of bivariate outcomes based on the distribution regression and factorization methods. This method is considered semiparametric in that it allows for flexible modeling of both the marginal and joint distributions conditional on covariates without imposing global parametric assumptions across the entire distribution. In contrast to existing parametric approaches, our method can accommodate discrete, continuous, or mixed variables, and provides a simple yet effective way to capture distributional dependence structures between bivariate outcomes and covariates. Various simulation results confirm that our method can perform similarly or better in finite samples compared to the alternative methods. In an application to the study of a motor third-part liability insurance portfolio, the proposed method effectively estimates risk measures such as the conditional Value-at-Risks and Expexted Sortfall. This result suggests that this semiparametric approach can serve as an alternative in insurance risk management.
Tyler's and Maronna's M-estimators, as well as their regularized variants, are popular robust methods to estimate the scatter or covariance matrix of a multivariate distribution. In this work, we study the non-asymptotic behavior of these estimators, for data sampled from a distribution that satisfies one of the following properties: 1) independent sub-Gaussian entries, up to a linear transformation; 2) log-concave distributions; 3) distributions satisfying a convex concentration property. Our main contribution is the derivation of tight non-asymptotic concentration bounds of these M-estimators around a suitably scaled version of the data sample covariance matrix. Prior to our work, non-asymptotic bounds were derived only for Elliptical and Gaussian distributions. Our proof uses a variety of tools from non asymptotic random matrix theory and high dimensional geometry. Finally, we illustrate the utility of our results on two examples of practical interest: sparse covariance and sparse precision matrix estimation.
Informative cluster size (ICS) arises in situations with clustered data where a latent relationship exists between the number of participants in a cluster and the outcome measures. Although this phenomenon has been sporadically reported in statistical literature for nearly two decades now, further exploration is needed in certain statistical methodologies to avoid potentially misleading inferences. For inference about population quantities without covariates, inverse cluster size reweightings are often employed to adjust for ICS. Further, to study the effect of covariates on disease progression described by a multistate model, the pseudo-value regression technique has gained popularity in time-to-event data analysis. We seek to answer the question: "How to apply pseudo-value regression to clustered time-to-event data when cluster size is informative?" ICS adjustment by the reweighting method can be performed in two steps; estimation of marginal functions of the multistate model and fitting the estimating equations based on pseudo-value responses, leading to four possible strategies. We present theoretical arguments and thorough simulation experiments to ascertain the correct strategy for adjusting for ICS. A further extension of our methodology is implemented to include informativeness induced by the intra-cluster group size. We demonstrate the methods in two real-world applications: (i) to determine predictors of tooth survival in a periodontal study, and (ii) to identify indicators of ambulatory recovery in spinal cord injury patients who participated in locomotor-training rehabilitation.
Motivated by applications in unmanned aerial based ground penetrating radar for detecting buried landmines, we consider the problem of imaging small point like scatterers situated in a lossy medium below a random rough surface. Both the random rough surface and the absorption in the lossy medium significantly impede the target detection and imaging process. Using principal component analysis we effectively remove the reflection from the air-soil interface. We then use a modification of the classical synthetic aperture radar imaging functional to image the targets. This imaging method introduces a user-defined parameter, $\delta$, which scales the resolution by $\sqrt{\delta}$ allowing for target localization with sub wavelength accuracy. Numerical results in two dimensions illustrate the robustness of the approach for imaging multiple targets. However, the depth at which targets are detectable is limited due to the absorption in the lossy medium.
Pre-training is prevalent in nowadays deep learning to improve the learned model's performance. However, in the literature on federated learning (FL), neural networks are mostly initialized with random weights. These attract our interest in conducting a systematic study to explore pre-training for FL. Across multiple visual recognition benchmarks, we found that pre-training can not only improve FL, but also close its accuracy gap to the counterpart centralized learning, especially in the challenging cases of non-IID clients' data. To make our findings applicable to situations where pre-trained models are not directly available, we explore pre-training with synthetic data or even with clients' data in a decentralized manner, and found that they can already improve FL notably. Interestingly, many of the techniques we explore are complementary to each other to further boost the performance, and we view this as a critical result toward scaling up deep FL for real-world applications. We conclude our paper with an attempt to understand the effect of pre-training on FL. We found that pre-training enables the learned global models under different clients' data conditions to converge to the same loss basin, and makes global aggregation in FL more stable. Nevertheless, pre-training seems to not alleviate local model drifting, a fundamental problem in FL under non-IID data.
Algorithm aversion occurs when humans are reluctant to use algorithms despite their superior performance. Studies show that giving users outcome control by providing agency over how models' predictions are incorporated into decision-making mitigates algorithm aversion. We study whether algorithm aversion is mitigated by process control, wherein users can decide what input factors and algorithms to use in model training. We conduct a replication study of outcome control, and test novel process control study conditions on Amazon Mechanical Turk (MTurk) and Prolific. Our results partly confirm prior findings on the mitigating effects of outcome control, while also forefronting reproducibility challenges. We find that process control in the form of choosing the training algorithm mitigates algorithm aversion, but changing inputs does not. Furthermore, giving users both outcome and process control does not reduce algorithm aversion more than outcome or process control alone. This study contributes to design considerations around mitigating algorithm aversion.
Correlated outcomes are common in many practical problems. In some settings, one outcome is of particular interest, and others are auxiliary. To leverage information shared by all the outcomes, traditional multi-task learning (MTL) minimizes an averaged loss function over all the outcomes, which may lead to biased estimation for the target outcome, especially when the MTL model is mis-specified. In this work, based on a decomposition of estimation bias into two types, within-subspace and against-subspace, we develop a robust transfer learning approach to estimating a high-dimensional linear decision rule for the outcome of interest with the presence of auxiliary outcomes. The proposed method includes an MTL step using all outcomes to gain efficiency, and a subsequent calibration step using only the outcome of interest to correct both types of biases. We show that the final estimator can achieve a lower estimation error than the one using only the single outcome of interest. Simulations and real data analysis are conducted to justify the superiority of the proposed method.
Classic algorithms and machine learning systems like neural networks are both abundant in everyday life. While classic computer science algorithms are suitable for precise execution of exactly defined tasks such as finding the shortest path in a large graph, neural networks allow learning from data to predict the most likely answer in more complex tasks such as image classification, which cannot be reduced to an exact algorithm. To get the best of both worlds, this thesis explores combining both concepts leading to more robust, better performing, more interpretable, more computationally efficient, and more data efficient architectures. The thesis formalizes the idea of algorithmic supervision, which allows a neural network to learn from or in conjunction with an algorithm. When integrating an algorithm into a neural architecture, it is important that the algorithm is differentiable such that the architecture can be trained end-to-end and gradients can be propagated back through the algorithm in a meaningful way. To make algorithms differentiable, this thesis proposes a general method for continuously relaxing algorithms by perturbing variables and approximating the expectation value in closed form, i.e., without sampling. In addition, this thesis proposes differentiable algorithms, such as differentiable sorting networks, differentiable renderers, and differentiable logic gate networks. Finally, this thesis presents alternative training strategies for learning with algorithms.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.