The emergence of powerful image editing software has substantially facilitated digital image tampering, leading to many security issues. Hence, it is urgent to identify tampered images and localize tampered regions. Although much attention has been devoted to image tampering localization in recent years, it is still challenging to perform tampering localization in practical forensic applications. The reasons include the difficulty of learning discriminative representations of tampering traces and the lack of realistic tampered images for training. Since Photoshop is widely used for image tampering in practice, this paper attempts to address the issue of tampering localization by focusing on the detection of commonly used editing tools and operations in Photoshop. In order to well capture tampering traces, a fully convolutional encoder-decoder architecture is designed, where dense connections and dilated convolutions are adopted for achieving better localization performance. In order to effectively train a model in the case of insufficient tampered images, we design a training data generation strategy by resorting to Photoshop scripting, which can imitate human manipulations and generate large-scale training samples. Extensive experimental results show that the proposed approach outperforms state-of-the-art competitors when the model is trained with only generated images or fine-tuned with a small amount of realistic tampered images. The proposed method also has good robustness against some common post-processing operations.
During the Covid, online meetings have become an indispensable part of our lives. This trend is likely to continue due to their convenience and broad reach. However, background noise from other family members, roommates, office-mates not only degrades the voice quality but also raises serious privacy issues. In this paper, we develop a novel system, called Spatial Aware Multi-task learning-based Separation (SAMS), to extract audio signals from the target user during teleconferencing. Our solution consists of three novel components: (i) generating fine-grained location embeddings from the user's voice and inaudible tracking sound, which contains the user's position and rich multipath information, (ii) developing a source separation neural network using multi-task learning to jointly optimize source separation and location, and (iii) significantly speeding up inference to provide a real-time guarantee. Our testbed experiments demonstrate the effectiveness of our approach
Teleoperation is a widely adopted strategy to control robotic manipulators executing complex tasks that require highly dexterous movements and critical high-level intelligence. Classical teleoperation schemes are based on either joystick control, or on more intuitive interfaces which map directly the user arm motions into one robot arm's motions. These approaches have limits when the execution of a given task requires reconfigurable multiple robotic arm systems. Indeed, the simultaneous teleoperation of two or more robot arms could extend the workspace of the manipulation cell, or increase its total payload, or afford other advantages. In different phases of a reconfigurable multi-arm system, each robot could act as an independent arm, or as one of a pair of cooperating arms, or as one of the fingers of a virtual, large robot hand. This manuscript proposes a novel telemanipulation framework that enables both the individual and combined control of any number of robotic arms. Thanks to the designed control architecture, the human operator can intuitively choose the proposed control modalities and the manipulators that make the task convenient to execute through the user interface. Moreover, through the tele-impedance paradigm, the system can address complex tasks that require physical interaction by letting the robot mimic the arm impedance and position references of the human operator. The proposed framework is validated with 8 subjects controlling 4 Franka Emika Panda robots with 7-DoFs to execute a telemanipulation task. Qualitative results of the experiments show us the promising applicability of our framework.
Contrary to other standard cameras, event cameras interpret the world in an entirely different manner; as a collection of asynchronous events. Despite event camera's unique data output, many event feature detection and tracking algorithms have shown significant progress by making detours to frame-based data representations. This paper questions the need to do so and proposes a novel event data-friendly method that achieve simultaneous feature detection and tracking, called event Clustering-based Detection and Tracking (eCDT). Our method employs a novel clustering method, named as k-NN Classifier-based Spatial Clustering and Applications with Noise (KCSCAN), to cluster adjacent polarity events to retrieve event trajectories.With the aid of a Head and Tail Descriptor Matching process, event clusters that reappear in a different polarity are continually tracked, elongating the feature tracks. Thanks to our clustering approach in spatio-temporal space, our method automatically solves feature detection and feature tracking simultaneously. Also, eCDT can extract feature tracks at any frequency with an adjustable time window, which does not corrupt the high temporal resolution of the original event data. Our method achieves 30% better feature tracking ages compared with the state-of-the-art approach while also having a low error approximately equal to it.
Manipulation of biomedical images to misrepresent experimental results has plagued the biomedical community for a while. Recent interest in the problem led to the curation of a dataset and associated tasks to promote the development of biomedical forensic methods. Of these, the largest manipulation detection task focuses on the detection of duplicated regions between images. Traditional computer-vision based forensic models trained on natural images are not designed to overcome the challenges presented by biomedical images. We propose a multi-scale overlap detection model to detect duplicated image regions. Our model is structured to find duplication hierarchically, so as to reduce the number of patch operations. It achieves state-of-the-art performance overall and on multiple biomedical image categories.
Object manipulation in cluttered scenes is a difficult and important problem in robotics. To efficiently manipulate objects, it is crucial to understand their surroundings, especially in cases where multiple objects are stacked one on top of the other, preventing effective grasping. We here present DUQIM-Net, a decision-making approach for object manipulation in a setting of stacked objects. In DUQIM-Net, the hierarchical stacking relationship is assessed using Adj-Net, a model that leverages existing Transformer Encoder-Decoder object detectors by adding an adjacency head. The output of this head probabilistically infers the underlying hierarchical structure of the objects in the scene. We utilize the properties of the adjacency matrix in DUQIM-Net to perform decision making and assist with object-grasping tasks. Our experimental results show that Adj-Net surpasses the state-of-the-art in object-relationship inference on the Visual Manipulation Relationship Dataset (VMRD), and that DUQIM-Net outperforms comparable approaches in bin clearing tasks.
Knowledge Distillation (KD) is a widely-used technology to inherit information from cumbersome teacher models to compact student models, consequently realizing model compression and acceleration. Compared with image classification, object detection is a more complex task, and designing specific KD methods for object detection is non-trivial. In this work, we elaborately study the behaviour difference between the teacher and student detection models, and obtain two intriguing observations: First, the teacher and student rank their detected candidate boxes quite differently, which results in their precision discrepancy. Second, there is a considerable gap between the feature response differences and prediction differences between teacher and student, indicating that equally imitating all the feature maps of the teacher is the sub-optimal choice for improving the student's accuracy. Based on the two observations, we propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors, respectively. RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill, which consistently outperforms the traditional soft label distillation. PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy. On MS COCO and PASCAL VOC benchmarks, extensive experiments are conducted on various detectors with different backbones to validate the effectiveness of our method. Specifically, RetinaNet with ResNet50 achieves 40.4% mAP in MS COCO, which is 3.5% higher than its baseline, and also outperforms previous KD methods.
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifact surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to be taken into account by current semantic segmentation network based methods. Our thoughts are realized by a new network which we term MVSS-Net. Extensive experiments on five benchmark sets justify the viability of MVSS-Net for both pixel-level and image-level manipulation detection.
Object detectors usually achieve promising results with the supervision of complete instance annotations. However, their performance is far from satisfactory with sparse instance annotations. Most existing methods for sparsely annotated object detection either re-weight the loss of hard negative samples or convert the unlabeled instances into ignored regions to reduce the interference of false negatives. We argue that these strategies are insufficient since they can at most alleviate the negative effect caused by missing annotations. In this paper, we propose a simple but effective mechanism, called Co-mining, for sparsely annotated object detection. In our Co-mining, two branches of a Siamese network predict the pseudo-label sets for each other. To enhance multi-view learning and better mine unlabeled instances, the original image and corresponding augmented image are used as the inputs of two branches of the Siamese network, respectively. Co-mining can serve as a general training mechanism applied to most of modern object detectors. Experiments are performed on MS COCO dataset with three different sparsely annotated settings using two typical frameworks: anchor-based detector RetinaNet and anchor-free detector FCOS. Experimental results show that our Co-mining with RetinaNet achieves 1.4%~2.1% improvements compared with different baselines and surpasses existing methods under the same sparsely annotated setting.
Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect $2806$ aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using $15$ common object categories. The fully annotated DOTA images contains $188,282$ instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multi-scale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global high-level features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multi-scale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 state-of-the-art approaches.