An outfit visualization method generates an image of a person wearing real garments from images of those garments. Current methods can produce images that look realistic and preserve garment identity, captured in details such as collar, cuffs, texture, hem, and sleeve length. However, no current method can both control how the garment is worn -- including tuck or untuck, opened or closed, high or low on the waist, etc.. -- and generate realistic images that accurately preserve the properties of the original garment. We describe an outfit visualization method that controls drape while preserving garment identity. Our system allows instance independent editing of garment drape, which means a user can construct an edit (e.g. tucking a shirt in a specific way) that can be applied to all shirts in a garment collection. Garment detail is preserved by relying on a warping procedure to place the garment on the body and a generator then supplies fine shading detail. To achieve instance independent control, we use control points with garment category-level semantics to guide the warp. The method produces state-of-the-art quality images, while allowing creative ways to style garments, including allowing tops to be tucked or untucked; jackets to be worn open or closed; skirts to be worn higher or lower on the waist; and so on. The method allows interactive control to correct errors in individual renderings too. Because the edits are instance independent, they can be applied to large pools of garments automatically and can be conditioned on garment metadata (e.g. all cropped jackets are worn closed or all bomber jackets are worn closed).
Text-to-image generative models have recently exploded in popularity and accessibility. Yet so far, use of these models in creative tasks that bridge the 2D digital world and the creation of physical artefacts has been understudied. We conduct a pilot study to investigate if and how text-to-image models can be used to assist in upstream tasks within the creative process, such as ideation and visualization, prior to a sculpture-making activity. Thirty participants selected sculpture-making materials and generated three images using the Stable Diffusion text-to-image generator, each with text prompts of their choice, with the aim of informing and then creating a physical sculpture. The majority of participants (23/30) reported that the generated images informed their sculptures, and 28/30 reported interest in using text-to-image models to help them in a creative task in the future. We identify several prompt engineering strategies and find that a participant's prompting strategy relates to their stage in the creative process. We discuss how our findings can inform support for users at different stages of the design process and for using text-to-image models for physical artefact design.
Autonomous robots are required to reason about the behaviour of dynamic agents in their environment. To this end, many approaches assume that causal models describing the interactions of agents are given a priori. However, in many application domains such models do not exist or cannot be engineered. Hence, the learning (or discovery) of high-level causal structures from low-level, temporal observations is a key problem in AI and robotics. However, the application of causal discovery methods to scenarios involving autonomous agents remains in the early stages of research. While a number of methods exist for performing causal discovery on time series data, these usually rely upon assumptions such as sufficiency and stationarity which cannot be guaranteed in interagent behavioural interactions in the real world. In this paper we are applying contemporary observation-based temporal causal discovery techniques to real world and synthetic driving scenarios from multiple datasets. Our evaluation demonstrates and highlights the limitations of state of the art approaches by comparing and contrasting the performance between real and synthetically generated data. Finally, based on our analysis, we discuss open issues related to causal discovery on autonomous robotics scenarios and propose future research directions for overcoming current limitations in the field.
In this paper, we consider a semiconducting device with an active zone made of a single-layer material. The associated Poisson equation for the electrostatic potential (to be solved in order to perform self-consistent computations) is characterized by a surface particle density and an out-of-plane dielectric permittivity in the region surrounding the single-layer. To avoid mesh refinements in such a region, we propose an interface problem based on the natural domain decomposition suggested by the physical device. Two different interface continuity conditions are discussed. Then, we write the corresponding variational formulations adapting the so called three-fields formulation for domain decomposition and we approximate them using a proper finite element method. Finally, numerical experiments are performed to illustrate some specific features of this interface approach.
Extremely large-scale array (XL-array) is envisioned to achieve super-high spectral efficiency in future wireless networks. Different from the existing works that mostly focus on the near-field communications, we consider in this paper a new and practical scenario, called mixed near- and far-field communications, where there exist both near- and far-field users in the network. For this scenario, we first obtain a closed-form expression for the inter-user interference at the near-field user caused by the far-field beam by using Fresnel functions, based on which the effects of the number of BS antennas, far-field user (FU) angle, near-field user (NU) angle and distance are analyzed. We show that the strong interference exists when the number of the BS antennas and the NU distance are relatively small, and/or the NU and FU angle-difference is small. Then, we further obtain the achievable rate of the NU as well as its rate loss caused by the FU interference. Last, numerical results are provided to corroborate our analytical results.
Modern video streaming services require quality assurance of the presented audiovisual material. Quality assurance mechanisms allow streaming platforms to provide quality levels that are considered sufficient to yield user satisfaction, with the least possible amount of data transferred. A variety of measures and approaches have been developed to control video quality, e.g., by adapting it to network conditions. These include objective matrices of the quality and thresholds identified by means of subjective perceptual judgments. The former group of matrices has recently gained the attention of (multi)media researchers. They call this area of study ``Quality of Experience'' (QoE). In this paper, we present a review of QoE's theoretical models together with a discussion of their properties and implications for the field. We argue that most of them represent the bottom-up approach to modeling. Such models focus on describing as many variables as possible, but with a limited ability to investigate the causal relationship between them; therefore, the applicability of the findings in practice is limited. To advance the field, we therefore propose a structural, top-down model of video QoE that describes causal relationships among variables. We hope that our framework will facilitate designing comparable experiments in the domain.
We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.
As knowledge graph has the potential to bridge the gap between commonsense knowledge and reasoning over actionable capabilities of mobile robotic platforms, incorporating knowledge graph into robotic system attracted increasing attention in recent years. Previously, graph visualization has been used wildly by developers to make sense of knowledge representations. However, due to lacking the link between abstract knowledge of the real-world environment and the robot's actions, transitional visualization tools are incompatible for expert-user to understand, test, supervise and modify the graph-based reasoning system with the embodiment of the robots. Therefore, we developed an interface which enables robotic experts to send commands to the robot in natural language, then interface visualizes the procedures of the robot mapping the command to the functions for querying in the commonsense knowledge database, links the result to the real world instances in a 3D map and demonstrate the execution of the robot from the first-person perspective of the robot. After 3 weeks of usage of the system by robotic experts in their daily development, some feedback was collected, which provides insight for designing such systems.
Despite recent advances in data-independent and deep-learning algorithms, unstained live adherent cell instance segmentation remains a long-standing challenge in cell image processing. Adherent cells' inherent visual characteristics, such as low contrast structures, fading edges, and irregular morphology, have made it difficult to distinguish from one another, even by human experts, let alone computational methods. In this study, we developed a novel deep-learning algorithm called dual-view selective instance segmentation network (DVSISN) for segmenting unstained adherent cells in differential interference contrast (DIC) images. First, we used a dual-view segmentation (DVS) method with pairs of original and rotated images to predict the bounding box and its corresponding mask for each cell instance. Second, we used a mask selection (MS) method to filter the cell instances predicted by the DVS to keep masks closest to the ground truth only. The developed algorithm was trained and validated on our dataset containing 520 images and 12198 cells. Experimental results demonstrate that our algorithm achieves an AP_segm of 0.555, which remarkably overtakes a benchmark by a margin of 23.6%. This study's success opens up a new possibility of using rotated images as input for better prediction in cell images.
Inspired by the human cognitive system, attention is a mechanism that imitates the human cognitive awareness about specific information, amplifying critical details to focus more on the essential aspects of data. Deep learning has employed attention to boost performance for many applications. Interestingly, the same attention design can suit processing different data modalities and can easily be incorporated into large networks. Furthermore, multiple complementary attention mechanisms can be incorporated in one network. Hence, attention techniques have become extremely attractive. However, the literature lacks a comprehensive survey specific to attention techniques to guide researchers in employing attention in their deep models. Note that, besides being demanding in terms of training data and computational resources, transformers only cover a single category in self-attention out of the many categories available. We fill this gap and provide an in-depth survey of 50 attention techniques categorizing them by their most prominent features. We initiate our discussion by introducing the fundamental concepts behind the success of attention mechanism. Next, we furnish some essentials such as the strengths and limitations of each attention category, describe their fundamental building blocks, basic formulations with primary usage, and applications specifically for computer vision. We also discuss the challenges and open questions related to attention mechanism in general. Finally, we recommend possible future research directions for deep attention.
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO, and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.