This paper proposes a Lasso-type estimator for a high-dimensional sparse parameter identified by a single index conditional moment restriction (CMR). In addition to this parameter, the moment function can also depend on a nuisance function, such as the propensity score or the conditional choice probability, which we estimate by modern machine learning tools. We first adjust the moment function so that the gradient of the future loss function is insensitive (formally, Neyman-orthogonal) with respect to the first-stage regularization bias, preserving the single index property. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the single index. The proposed Lasso estimator converges at the oracle rate, where the oracle knows the nuisance function and solves only the parametric problem. We demonstrate our method by estimating the short-term heterogeneous impact of Connecticut's Jobs First welfare reform experiment on women's welfare participation decision.
Support Vector Machine (SVM) is one of the most popular classification methods, and a de-facto reference for many Machine Learning approaches. Its performance is determined by parameter selection, which is usually achieved by a time-consuming grid search cross-validation procedure. There exist, however, several unsupervised heuristics that take advantage of the characteristics of the dataset for selecting parameters instead of using class label information. Unsupervised heuristics, while an order of magnitude faster, are scarcely used under the assumption that their results are significantly worse than those of grid search. To challenge that assumption we have conducted a wide study of various heuristics for SVM parameter selection on over thirty datasets, in both supervised and semi-supervised scenarios. In most cases, the cross-validation grid search did not achieve a significant advantage over the heuristics. In particular, heuristical parameter selection may be preferable for high dimensional and unbalanced datasets or when a small number of examples is available. Our results also show that using a heuristic to determine the starting point of further cross-validation does not yield significantly better results than the default start.
We develop a dynamic trading strategy in the Linear Quadratic Regulator (LQR) framework. By including a price mean-reversion signal into the optimization program, in a trading environment where market impact is linear and stage costs are quadratic, we obtain an optimal trading curve that reacts opportunistically to price changes while retaining its ability to satisfy smooth or hard completion constraints. The optimal allocation is affine in the spot price and in the number of outstanding shares at any time, and it can be fully derived iteratively. It is also aggressive in the money, meaning that it accelerates whenever the price is favorable, with an intensity that can be calibrated by the practitioner. Since the LQR may yield locally negative participation rates (i.e round trip trades) which are often undesirable, we show that the aforementioned optimization problem can be improved and solved under positivity constraints following a Model Predictive Control (MPC) approach. In particular, it is smoother and more consistent with the completion constraint than putting a hard floor on the participation rate. We finally examine how the LQR can be simplified in the continuous trading context, which allows us to derive a closed formula for the trading curve under further assumptions, and we document a two-step strategy for the case where trades can also occur in an additional dark pool.
The function-on-function linear regression model in which the response and predictors consist of random curves has become a general framework to investigate the relationship between the functional response and functional predictors. Existing methods to estimate the model parameters may be sensitive to outlying observations, common in empirical applications. In addition, these methods may be severely affected by such observations, leading to undesirable estimation and prediction results. A robust estimation method, based on iteratively reweighted simple partial least squares, is introduced to improve the prediction accuracy of the function-on-function linear regression model in the presence of outliers. The performance of the proposed method is based on the number of partial least squares components used to estimate the function-on-function linear regression model. Thus, the optimum number of components is determined via a data-driven error criterion. The finite-sample performance of the proposed method is investigated via several Monte Carlo experiments and an empirical data analysis. In addition, a nonparametric bootstrap method is applied to construct pointwise prediction intervals for the response function. The results are compared with some of the existing methods to illustrate the improvement potentially gained by the proposed method.
We propose three test criteria each of which is appropriate for testing, respectively, the equivalence hypotheses of symmetry, of homogeneity, and of independence, with multivariate data. All quantities have the common feature of involving weighted--type distances between characteristic functions and are convenient from the computational point of view if the weight function is properly chosen. The asymptotic behavior of the tests under the null hypothesis is investigated, and numerical studies are conducted in order to examine the performance of the criteria in finite samples.
In this paper, we establish minimax optimal rates of convergence for prediction in a semi-functional linear model that consists of a functional component and a less smooth nonparametric component. Our results reveal that the smoother functional component can be learned with the minimax rate as if the nonparametric component were known. More specifically, a double-penalized least squares method is adopted to estimate both the functional and nonparametric components within the framework of reproducing kernel Hilbert spaces. By virtue of the representer theorem, an efficient algorithm that requires no iterations is proposed to solve the corresponding optimization problem, where the regularization parameters are selected by the generalized cross validation criterion. Numerical studies are provided to demonstrate the effectiveness of the method and to verify the theoretical analysis.
This paper studies task adaptive pre-trained model selection, an \emph{underexplored} problem of assessing pre-trained models so that models suitable for the task can be selected from the model zoo without fine-tuning. A pilot work~\cite{nguyen_leep:_2020} addressed the problem in transferring supervised pre-trained models to classification tasks, but it cannot handle emerging unsupervised pre-trained models or regression tasks. In pursuit of a practical assessment method, we propose to estimate the maximum evidence (marginalized likelihood) of labels given features extracted by pre-trained models. The maximum evidence is \emph{less prone to over-fitting} than the likelihood, and its \emph{expensive computation can be dramatically reduced} by our carefully designed algorithm. The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning: a pre-trained model with high LogME is likely to have good transfer performance. LogME is fast, accurate, and general, characterizing it as \emph{the first practical assessment method for transfer learning}. Compared to brute-force fine-tuning, LogME brings over $3000\times$ speedup in wall-clock time. It outperforms prior methods by a large margin in their setting and is applicable to new settings that prior methods cannot deal with. It is general enough to diverse pre-trained models (supervised pre-trained and unsupervised pre-trained), downstream tasks (classification and regression), and modalities (vision and language). Code is at \url{//github.com/thuml/LogME}.
This paper presents a hardness-aware deep metric learning (HDML) framework. Most previous deep metric learning methods employ the hard negative mining strategy to alleviate the lack of informative samples for training. However, this mining strategy only utilizes a subset of training data, which may not be enough to characterize the global geometry of the embedding space comprehensively. To address this problem, we perform linear interpolation on embeddings to adaptively manipulate their hard levels and generate corresponding label-preserving synthetics for recycled training, so that information buried in all samples can be fully exploited and the metric is always challenged with proper difficulty. Our method achieves very competitive performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets.
Learning embedding functions, which map semantically related inputs to nearby locations in a feature space supports a variety of classification and information retrieval tasks. In this work, we propose a novel, generalizable and fast method to define a family of embedding functions that can be used as an ensemble to give improved results. Each embedding function is learned by randomly bagging the training labels into small subsets. We show experimentally that these embedding ensembles create effective embedding functions. The ensemble output defines a metric space that improves state of the art performance for image retrieval on CUB-200-2011, Cars-196, In-Shop Clothes Retrieval and VehicleID.
Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. We present mixed-integer optimization approaches to find optimal distance metrics that generalize the Mahalanobis metric extensively studied in the literature. Additionally, we generalize and improve upon leading methods by removing reliance on pre-designated "target neighbors," "triplets," and "similarity pairs." Another salient feature of our method is its ability to enable active learning by recommending precise regions to sample after an optimal metric is computed to improve classification performance. This targeted acquisition can significantly reduce computational burden by ensuring training data completeness, representativeness, and economy. We demonstrate classification and computational performance of the algorithms through several simple and intuitive examples, followed by results on real image and medical datasets.
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over state-of-the-art methods on the CUB 200-2011, Cars-196, Stanford Online Products, In-Shop Clothes Retrieval and VehicleID datasets.