We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effects from a speech production model. We use the Pink Trombone synthesizer as a case study of a simplified production model of the vocal tract to target non-speech human audio signals --yawnings. We selected and optimized the control parameters of the synthesizer to minimize the difference between real and generated audio. We validated the most common optimization techniques reported in the literature and a specifically designed neural network. We evaluated several popular quality metrics as error functions. These include both objective quality metrics and subjective-equivalent metrics. We compared the results in terms of total error and computational demand. Results show that genetic and swarm optimizers outperform least squares algorithms at the cost of executing slower and that specific combinations of optimizers and audio representations offer significantly different results. The proposed methodology could be used in benchmarking other physical models and audio types.
The current body of research on terahertz (THz) wireless communications predominantly focuses on its application for single-user backhaul/fronthaul connectivity at sub-THz frequencies. First, we develop a generalized statistical model for signal propagation at THz frequencies encompassing physical layer impairments, including random path-loss with Gamma distribution for the molecular absorption coefficient, short-term fading characterized by the $\alpha$-$\eta$-$\kappa$-$\mu$ distribution, antenna misalignment errors, and transceiver hardware impairments. Next, we propose random access protocols for a cell-free wireless network, ensuring successful transmission for multiple users with limited delay and energy loss, exploiting the combined effect of random atmospheric absorption, non-linearity of fading, hardware impairments, and antenna misalignment errors. We consider two schemes: a fixed transmission probability (FTP) scheme where the transmission probability (TP) of each user is updated at the beginning of the data transmission and an adaptive transmission probability (ATP) scheme where the TP is updated with each successful reception of the data. We analyze the performance of both protocols using delay, energy consumption, and outage probability with scaling laws for the transmission of a data frame consisting of a single packet from users at a predefined quality of service (QoS).
Subjective assessment tests are often employed to evaluate image processing systems, notably image and video compression, super-resolution among others and have been used as an indisputable way to provide evidence of the performance of an algorithm or system. While several methodologies can be used in a subjective quality assessment test, pairwise comparison tests are nowadays attracting a lot of attention due to their accuracy and simplicity. However, the number of comparisons in a pairwise comparison test increases quadratically with the number of stimuli and thus often leads to very long tests, which is impractical for many cases. However, not all the pairs contribute equally to the final score and thus, it is possible to reduce the number of comparisons without degrading the final accuracy. To do so, pairwise sampling methods are often used to select the pairs which provide more information about the quality of each stimuli. In this paper, a reliable and much-needed evaluation procedure is proposed and used for already available methods in the literature, especially considering the case of subjective evaluation of image and video codecs. The results indicate that an appropriate selection of the pairs allows to achieve very reliable scores while requiring the comparison of a much lower number of pairs.
Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structures. Then, we propose a novel recommendation model by incorporating ID embeddings to enhance the semantic features of both content and structures. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolutional network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.
In the high performance computing (HPC) domain, performance variability is a major scalability issue for parallel computing applications with heavy synchronization and communication. In this paper, we present an experimental performance analysis of OpenMP benchmarks regarding the variation of execution time, and determine the potential factors causing performance variability. Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation.
We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.
We tackle the challenge of robotic bin packing with irregular objects, such as groceries. Given the diverse physical attributes of these objects and the complex constraints governing their placement and manipulation, employing preprogrammed strategies becomes unfeasible. Our approach is to learn directly from expert demonstrations in order to extract implicit task knowledge and strategies to ensure safe object positioning, efficient use of space, and the generation of human-like behaviors that enhance human-robot trust. We rely on human demonstrations to learn a Markov chain for predicting the object packing sequence for a given set of items and then compare it with human performance. Our experimental results show that the model outperforms human performance by generating sequence predictions that humans classify as human-like more frequently than human-generated sequences. The human demonstrations were collected using our proposed VR platform, BoxED, which is a box packaging environment for simulating real-world objects and scenarios for fast and streamlined data collection with the purpose of teaching robots. We collected data from 43 participants packing a total of 263 boxes with supermarket-like objects, yielding 4644 object manipulations. Our VR platform can be easily adapted to new scenarios and objects, and is publicly available, alongside our dataset, at //github.com/andrejfsantos4/BoxED.
The fundamental diagram serves as the foundation of traffic flow modeling for almost a century. With the increasing availability of road sensor data, deterministic parametric models have proved inadequate in describing the variability of real-world data, especially in congested area of the density-flow diagram. In this paper we estimate the stochastic density-flow relation introducing a nonparametric method called convex quantile regression. The proposed method does not depend on any prior functional form assumptions, but thanks to the concavity constraints, the estimated function satisfies the theoretical properties of the density-flow curve. The second contribution is to develop the new convex quantile regression with bags (CQRb) approach to facilitate practical implementation of CQR to the real-world data. We illustrate the CQRb estimation process using the road sensor data from Finland in years 2016-2018. Our third contribution is to demonstrate the excellent out-of-sample predictive power of the proposed CQRb method in comparison to the standard parametric deterministic approach.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.