This letter proposes a regression model for nonnegative signals. The proposed regression estimates the mean of Rayleigh distributed signals by a structure which includes a set of regressors and a link function. For the proposed model, we present: (i)~parameter estimation; (ii)~large data record results; and (iii)~a detection technique. In this letter, we present closed-form expressions for the score vector and Fisher information matrix. The proposed model is submitted to extensive Monte Carlo simulations and to measured data. The Monte Carlo simulations are used to evaluate the performance of maximum likelihood estimators. Also, an application is performed comparing the detection results of the proposed model with Gaussian-, Gamma-, and Weibull-based regression models in SAR images.
We propose a data segmentation methodology for the high-dimensional linear regression problem where the regression parameters are allowed to undergo multiple changes. The proposed methodology, MOSEG, proceeds in two stages where the data is first scanned for multiple change points using a moving window-based procedure, which is followed by a location refinement stage. MOSEG enjoys computational efficiency thanks to the adoption of a coarse grid in the first stage, as well as achieving theoretical consistency in estimating both the total number and the locations of the change points without requiring independence or sub-Gaussianity. In particular, it nearly matches minimax optimal rates when Gaussianity is assumed. We also propose MOSEG.MS, a multiscale extension of MOSEG which, while comparable to MOSEG in terms of computational complexity, achieves theoretical consistency for a broader parameter space that permits multiscale change points. We demonstrate good performance of the proposed methods in comparative simulation studies and also in applications to climate science and economic datasets.
This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning where the object is typically partially occluded by the manipulating hand. Currently, there is a lack of datasets that would enable the development of robust 6D pose estimation methods for these conditions. To overcome this problem, we collect a new dataset (Imitrob) aimed at 6D pose estimation in imitation learning and other applications where a human holds a tool and performs a task. The dataset contains image sequences of three different tools and six manipulation tasks with two camera viewpoints, four human subjects, and left/right hand. Each image is accompanied by an accurate ground truth measurement of the 6D object pose, obtained by the HTC Vive motion tracking device. The use of the dataset is demonstrated by training and evaluating a recent 6D object pose estimation method (DOPE) in various setups. The dataset and code are publicly available at //imitrob.ciirc.cvut.cz/imitrobdataset.php.
Interval-censored multi-state data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We formulate the effects of potentially time-dependent covariates on multi-state processes through semiparametric proportional intensity models with random effects. We adopt nonparametric maximum likelihood estimation (NPMLE) under general interval censoring and develop a stable expectation-maximization (EM) algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.
We show how to take a regression function $\hat{f}$ that is appropriately ``multicalibrated'' and efficiently post-process it into an approximately error minimizing classifier satisfying a large variety of fairness constraints. The post-processing requires no labeled data, and only a modest amount of unlabeled data and computation. The computational and sample complexity requirements of computing $\hat f$ are comparable to the requirements for solving a single fair learning task optimally, but it can in fact be used to solve many different downstream fairness-constrained learning problems efficiently. Our post-processing method easily handles intersecting groups, generalizing prior work on post-processing regression functions to satisfy fairness constraints that only applied to disjoint groups. Our work extends recent work showing that multicalibrated regression functions are ``omnipredictors'' (i.e. can be post-processed to optimally solve unconstrained ERM problems) to constrained optimization.
Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Human viewers can easily identify them but algorithms often confuse them with other objects. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects and the context of the image then to find the main subjects among them. This is achieved in three aspects. By gathering images from movie shots created by directors with professional shooting skills, we collect the dataset with strong diversity, specifically, it contains 107\,700 images from 21\,540 movie shots. We labeled them with the bounding box labels for two classes: subject and non-subject foreground object. We present a detailed analysis of the dataset and compare the task with saliency detection and object detection. ImageSubject is the first dataset that tries to localize the subject in an image that the photographer wants to highlight. Moreover, we find the transformer-based detection model offers the best result among other popular model architectures. Finally, we discuss the potential applications and conclude with the importance of the dataset.
We investigate $L_2$ boosting in the context of kernel regression. Kernel smoothers, in general, lack appealing traits like symmetry and positive definiteness, which are critical not only for understanding theoretical aspects but also for achieving good practical performance. We consider a projection-based smoother (Huang and Chen, 2008) that is symmetric, positive definite, and shrinking. Theoretical results based on the orthonormal decomposition of the smoother reveal additional insights into the boosting algorithm. In our asymptotic framework, we may replace the full-rank smoother with a low-rank approximation. We demonstrate that the smoother's low-rank ($d(n)$) is bounded above by $O(h^{-1})$, where $h$ is the bandwidth. Our numerical findings show that, in terms of prediction accuracy, low-rank smoothers may outperform full-rank smoothers. Furthermore, we show that the boosting estimator with low-rank smoother achieves the optimal convergence rate. Finally, to improve the performance of the boosting algorithm in the presence of outliers, we propose a novel robustified boosting algorithm which can be used with any smoother discussed in the study. We investigate the numerical performance of the proposed approaches using simulations and a real-world case.
Although deep neural networks have achieved reasonable accuracy in solving face alignment, it is still a challenging task, specifically when we deal with facial images, under occlusion, or extreme head poses. Heatmap-based Regression (HBR) and Coordinate-based Regression (CBR) are among the two mainly used methods for face alignment. CBR methods require less computer memory, though their performance is less than HBR methods. In this paper, we propose an Adaptive Coordinate-based Regression (ACR) loss to improve the accuracy of CBR for face alignment. Inspired by the Active Shape Model (ASM), we generate Smooth-Face objects, a set of facial landmark points with less variations compared to the ground truth landmark points. We then introduce a method to estimate the level of difficulty in predicting each landmark point for the network by comparing the distribution of the ground truth landmark points and the corresponding Smooth-Face objects. Our proposed ACR Loss can adaptively modify its curvature and the influence of the loss based on the difficulty level of predicting each landmark point in a face. Accordingly, the ACR Loss guides the network toward challenging points than easier points, which improves the accuracy of the face alignment task. Our extensive evaluation shows the capabilities of the proposed ACR Loss in predicting facial landmark points in various facial images.
Diffusion models are a new class of generative models that mark a milestone in high-quality image generation while relying on solid probabilistic principles. This makes them promising candidate models for neural image compression. This paper outlines an end-to-end optimized framework based on a conditional diffusion model for image compression. Besides latent variables inherent to the diffusion process, the model introduces an additional per-instance "content" latent variable to condition the denoising process. Upon decoding, the diffusion process conditionally generates/reconstructs an image using ancestral sampling. Our experiments show that this approach outperforms one of the best-performing conventional image codecs (BPG) and one neural codec on two compression benchmarks, where we focus on rate-perception tradeoffs. Qualitatively, our approach shows fewer decompression artifacts than the classical approach.
Substantial efforts have been devoted more recently to presenting various methods for object detection in optical remote sensing images. However, the current survey of datasets and deep learning based methods for object detection in optical remote sensing images is not adequate. Moreover, most of the existing datasets have some shortcomings, for example, the numbers of images and object categories are small scale, and the image diversity and variations are insufficient. These limitations greatly affect the development of deep learning based object detection methods. In the paper, we provide a comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities. Then, we propose a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images, which we name as DIOR. The dataset contains 23463 images and 192472 instances, covering 20 object classes. The proposed DIOR dataset 1) is large-scale on the object categories, on the object instance number, and on the total image number; 2) has a large range of object size variations, not only in terms of spatial resolutions, but also in the aspect of inter- and intra-class size variability across objects; 3) holds big variations as the images are obtained with different imaging conditions, weathers, seasons, and image quality; and 4) has high inter-class similarity and intra-class diversity. The proposed benchmark can help the researchers to develop and validate their data-driven methods. Finally, we evaluate several state-of-the-art approaches on our DIOR dataset to establish a baseline for future research.
Deep neural network architectures have traditionally been designed and explored with human expertise in a long-lasting trial-and-error process. This process requires huge amount of time, expertise, and resources. To address this tedious problem, we propose a novel algorithm to optimally find hyperparameters of a deep network architecture automatically. We specifically focus on designing neural architectures for medical image segmentation task. Our proposed method is based on a policy gradient reinforcement learning for which the reward function is assigned a segmentation evaluation utility (i.e., dice index). We show the efficacy of the proposed method with its low computational cost in comparison with the state-of-the-art medical image segmentation networks. We also present a new architecture design, a densely connected encoder-decoder CNN, as a strong baseline architecture to apply the proposed hyperparameter search algorithm. We apply the proposed algorithm to each layer of the baseline architectures. As an application, we train the proposed system on cine cardiac MR images from Automated Cardiac Diagnosis Challenge (ACDC) MICCAI 2017. Starting from a baseline segmentation architecture, the resulting network architecture obtains the state-of-the-art results in accuracy without performing any trial-and-error based architecture design approaches or close supervision of the hyperparameters changes.