Tactile sensing is an essential perception for robots to complete dexterous tasks. As a promising tactile sensing technique, vision-based tactile sensors have been developed to improve robot manipulation performance. Here we propose a new design of our vision-based tactile sensor, DelTact, with its high-resolution sensing abilities of multiple modality surface contact information. The sensor adopts an improved dense random color pattern based on previous version, using a modular hardware architecture to achieve higher accuracy of contact deformation tracking whilst at the same time maintaining a compact and robust overall design. In particular, we optimized the color pattern generation process and selected the appropriate pattern for coordinating with a dense optical flow in a real-world experimental sensory setting using varied contact objects. A dense tactile flow was obtained from the raw image in order to determine shape and force distribution on the contact surface. This sensor can be easily integrated with a parallel gripper where experimental results using qualitative and quantitative analysis demonstrated that the sensor is capable of providing tactile measurements with high temporal and spatial resolution.
Appearance-based gaze estimation aims to predict the 3D eye gaze direction from a single image. While recent deep learning-based approaches have demonstrated excellent performance, they usually assume one calibrated face in each input image and cannot output multi-person gaze in real time. However, simultaneous gaze estimation for multiple people in the wild is necessary for real-world applications. In this paper, we propose the first one-stage end-to-end gaze estimation method, GazeOnce, which is capable of simultaneously predicting gaze directions for multiple faces (>10) in an image. In addition, we design a sophisticated data generation pipeline and propose a new dataset, MPSGaze, which contains full images of multiple people with 3D gaze ground truth. Experimental results demonstrate that our unified framework not only offers a faster speed, but also provides a lower gaze estimation error compared with state-of-the-art methods. This technique can be useful in real-time applications with multiple users.
Industrial Control Systems (ICSs) rely on insecure protocols and devices to monitor and operate critical infrastructure. Prior work has demonstrated that powerful attackers with detailed system knowledge can manipulate exchanged sensor data to deteriorate performance of the process, even leading to full shutdowns of plants. Identifying those attacks requires iterating over all possible sensor values, and running detailed system simulation or analysis to identify optimal attacks. That setup allows adversaries to identify attacks that are most impactful when applied on the system for the first time, before the system operators become aware of the manipulations. In this work, we investigate if constrained attackers without detailed system knowledge and simulators can identify comparable attacks. In particular, the attacker only requires abstract knowledge on general information flow in the plant, instead of precise algorithms, operating parameters, process models, or simulators. We propose an approach that allows single-shot attacks, i.e., near-optimal attacks that are reliably shutting down a system on the first try. The approach is applied and validated on two use cases, and demonstrated to achieve comparable results to prior work, which relied on detailed system information and simulations.
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.
Vision-based tactile sensors have been widely studied in the robotics field for high spatial resolution and compatibility with machine learning algorithms. However, the currently employed sensor's imaging system is bulky limiting its further application. Here we present a micro lens array (MLA) based vison system to achieve a low thickness format of the sensor package with high tactile sensing performance. Multiple micromachined micro lens units cover the whole elastic touching layer and provide a stitched clear tactile image, enabling high spatial resolution with a thin thickness of 5 mm. The thermal reflow and soft lithography method ensure the uniform spherical profile and smooth surface of micro lens. Both optical and mechanical characterization demonstrated the sensor's stable imaging and excellent tactile sensing, enabling precise 3D tactile information, such as displacement mapping and force distribution with an ultra compact-thin structure.
This paper addresses the color image completion problem in accordance with low-rank quatenrion matrix optimization that is characterized by sparse regularization in a transformed domain. This research was inspired by an appreciation of the fact that different signal types, including audio formats and images, possess structures that are inherently sparse in respect of their respective bases. Since color images can be processed as a whole in the quaternion domain, we depicted the sparsity of the color image in the quaternion discrete cosine transform (QDCT) domain. In addition, the representation of a low-rank structure that is intrinsic to the color image is a vital issue in the quaternion matrix completion problem. To achieve a more superior low-rank approximation, the quatenrion-based truncated nuclear norm (QTNN) is employed in the proposed model. Moreover, this model is facilitated by a competent alternating direction method of multipliers (ADMM) based on the algorithm. Extensive experimental results demonstrate that the proposed method can yield vastly superior completion performance in comparison with the state-of-the-art low-rank matrix/quaternion matrix approximation methods tested on color image recovery.
The most common sensing modalities found in a robot perception system are vision and touch, which together can provide global and highly localized data for manipulation. However, these sensing modalities often fail to adequately capture the behavior of target objects during the critical moments as they transition out of static, controlled contact with an end-effector to dynamic and uncontrolled motion. In this work, we present a novel multimodal visuotactile sensor that provides simultaneous visuotactile and proximity depth data. The sensor integrates an RGB camera and air pressure sensor to sense touch with an infrared time-of-flight (ToF) camera to sense proximity by leveraging a selectively transmissive soft membrane to enable the dual sensing modalities. We present the mechanical design, fabrication techniques, algorithm implementations, and evaluation of the sensor's tactile and proximity modalities. The sensor is demonstrated in three open-loop robotic tasks: approaching and contacting an object, catching, and throwing. The fusion of tactile and proximity data could be used to capture key information about a target object's transition behavior for sensor-based control in dynamic manipulation.
The accurate diagnosis and molecular profiling of colorectal cancers are critical for planning the best treatment options for patients. Microsatellite instability (MSI) or mismatch repair (MMR) status plays a vital role inappropriate treatment selection, has prognostic implications and is used to investigate the possibility of patients having underlying genetic disorders (Lynch syndrome). NICE recommends that all CRC patients should be offered MMR/microsatellite instability (MSI) testing. Immunohistochemistry is commonly used to assess MMR status with subsequent molecular testing performed as required. This incurs significant extra costs and requires additional resources. The introduction of automated methods that can predict MSI or MMR status from a target image could substantially reduce the cost associated with MMR testing. Unlike previous studies on MSI prediction involving training a CNN using coarse labels (Microsatellite Instable vs Microsatellite Stable), we have utilised fine-grain MMR labels for training purposes. In this paper, we present our work on predicting MSI status in a two-stage process using a single target slide either stained with CK8/18 or H\&E. First, we trained a multi-headed convolutional neural network model where each head was responsible for predicting one of the MMR protein expressions. To this end, we performed the registration of MMR stained slides to the target slide as a pre-processing step. In the second stage, statistical features computed from the MMR prediction maps were used for the final MSI prediction. Our results demonstrated that MSI classification can be improved by incorporating fine-grained MMR labels in comparison to the previous approaches in which only coarse labels were utilised.
Nanodrone swarm is formulated by multiple light-weight and low-cost nanodrones to perform the tasks in very challenging environments. Therefore, it is essential to estimate the relative position of nanodrones in the swarm for accurate and safe platooning in inclement indoor environment. However, the vision and infrared sensors are constrained by the line-of-sight perception, and instrumenting extra motion sensors on drone's body is constrained by the nanodrone's form factor and energy-efficiency. This paper presents the design, implementation and evaluation of RFDrone, a system that can sense the relative position of nanodrone in the swarm using wireless signals, which can naturally identify each individual nanodrone. To do so, each light-weight nanodrone is attached with a RF sticker (i.e., called RFID tag), which will be localized by the external RFID reader in the inclement indoor environment. Instead of accurately localizing each RFID-tagged nanodrone, we propose to estimate the relative position of all the RFID-tagged nanodrones in the swarm based on the spatial-temporal phase profiling. We implement an end-to-end physical prototype of RFDrone. Our experimental results show that RFDrone can accurately estimate the relative position of nanodrones in the swarm with average relative localization accuracy of around 0.95 across x, y and z axis, and average accuracy of around 0.93 for nanodrone swarm's geometry estimation.
This article presents an overview of image transformation with a secret key and its applications. Image transformation with a secret key enables us not only to protect visual information on plain images but also to embed unique features controlled with a key into images. In addition, numerous encryption methods can generate encrypted images that are compressible and learnable for machine learning. Various applications of such transformation have been developed by using these properties. In this paper, we focus on a class of image transformation referred to as learnable image encryption, which is applicable to privacy-preserving machine learning and adversarially robust defense. Detailed descriptions of both transformation algorithms and performances are provided. Moreover, we discuss robustness against various attacks.
In this article, we aim to detect the double compression of MPEG-4, a universal video codec that is built into surveillance systems and shooting devices. Double compression is accompanied by various types of video manipulation, and its traces can be exploited to determine whether a video is a forgery. To this end, we present a neural network-based approach with discriminant features for capturing peculiar artifacts in the discrete cosine transform (DCT) domain caused by double MPEG-4 compression. By analyzing the intra-coding process of MPEG-4, which performs block-DCT-based quantization, we exploit multiple DCT histograms as features to focus on the statistical properties of DCT coefficients on multiresolution blocks. Furthermore, we improve detection performance using a vectorized feature of the quantization table on dense layers as auxiliary information. Compared with neural network-based approaches suitable for exploring subtle manipulations, the experimental results reveal that this work achieves high performance.