The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
In this work, our goal is to develop a theoretical framework that can eventually be used for analyzing the effectiveness of visual stories such as feature films to comic books. To develop this theoretical framework, we introduce a new story element called moments. Our conjecture is that any linear story such as the story of a feature film can be decomposed into a set of moments that follow each other. Moments are defined as the perception of the actions, interactions, and expressions of all characters or a single character during a given time period. We categorize the moments into two major types: story moments and discourse moments. Each type of moment can further be classified into three types, which we call universal storytelling moments. We believe these universal moments foster or deteriorate the emotional attachment of the audience to a particular character or the story. We present a methodology to catalog the occurrences of these universal moments as they are found in the story. The cataloged moments can be represented using curves or color strips. Therefore, we can visualize a character's journey through the story as either a 3D curve or a color strip. We also demonstrated that both story and discourse moments can be transformed into one lump-sum attraction parameter. The attraction parameter in time provides a function that can be plotted graphically onto a timeline illustrating changes in the emotional attachment of audience to a character or the story. By inspecting these functions the story analyst can analytically decipher the moments in the story where the attachment is being established, maintained, strengthened, or conversely where it is languishing.
Beamforming is a signal processing technique where an array of antenna elements can be steered to transmit and receive radio signals in a specific direction. The usage of millimeter wave (mmWave) frequencies and multiple input multiple output (MIMO) beamforming are considered as the key innovations of 5th Generation (5G) and beyond communication systems. The technique initially performs a beam alignment procedure, followed by data transfer in the aligned directions between the transmitter and the receiver. Traditionally, beam alignment involves periodical and exhaustive beam sweeping at both transmitter and the receiver, which is a slow process causing extra communication overhead with MIMO and massive MIMO radio units. In applications such as beam tracking, angular velocity, beam steering etc., the beam alignment procedure is optimized by estimating the beam directions using first order polynomial approximations. Recent learning-based SOTA strategies for fast mmWave beam alignment also require exploration over exhaustive beam pairs during the training procedure, causing overhead to learning strategies for higher antenna configurations. In this work, we first optimize the beam alignment cost functions e.g. the data rate, to reduce the beam sweeping overhead by applying polynomial approximations of its partial derivatives which can then be solved as a system of polynomial equations using well-known tools from algebraic geometry. At this point, a question arises: 'what is a good polynomial approximation?' In this work, we attempt to obtain a 'good polynomial approximation'. Preliminary experiments indicate that our estimated polynomial approximations attain a so-called sweet-spot in terms of the solver speed and accuracy, when evaluated on test beamforming problems.
Temporal relation extraction models have thus far been hindered by a number of issues in existing temporal relation-annotated news datasets, including: (1) low inter-annotator agreement due to the lack of specificity of their annotation guidelines in terms of what counts as a temporal relation; (2) the exclusion of long-distance relations within a given document (those spanning across different paragraphs); and (3) the exclusion of events that are not centred on verbs. This paper aims to alleviate these issues by presenting a new annotation scheme that clearly defines the criteria based on which temporal relations should be annotated. Additionally, the scheme includes events even if they are not expressed as verbs (e.g., nominalised events). Furthermore, we propose a method for annotating all temporal relations -- including long-distance ones -- which automates the process, hence reducing time and manual effort on the part of annotators. The result is a new dataset, the TIMELINE corpus, in which improved inter-annotator agreement was obtained, in comparison with previously reported temporal relation datasets. We report the results of training and evaluating baseline temporal relation extraction models on the new corpus, and compare them with results obtained on the widely used MATRES corpus.
Research in psychopathology has shown that, at an aggregate level, the patterns of emotional change over time -- emotion dynamics -- are indicators of one's mental health. One's patterns of emotion change have traditionally been determined through self-reports of emotions; however, there are known issues with accuracy, bias, and convenience. Recent approaches to determining emotion dynamics from one's everyday utterances, addresses many of these concerns, but it is not yet known whether these measures of utterance emotion dynamics (UED) correlate with mental health diagnoses. Here, for the first time, we study the relationship between tweet emotion dynamics and mental health disorders. We find that each of the UED metrics studied varied by the user's self-disclosed diagnosis. For example: average valence was significantly higher (i.e., more positive text) in the control group compared to users with ADHD, MDD, and PTSD. Valence variability was significantly lower in the control group compared to ADHD, depression, bipolar disorder, MDD, PTSD, and OCD but not PPD. Rise and recovery rates of valence also exhibited significant differences from the control. This work provides important early evidence for how linguistic cues pertaining to emotion dynamics can play a crucial role as biosocial markers for mental illnesses and aid in the understanding, diagnosis, and management of mental health disorders.
Background: Given that VR is applied in multiple domains, understanding the effects of cyber-sickness on human cognition and motor skills and the factors contributing to cybersickness gains urgency. This study aimed to explore the predictors of cybersickness and its interplay with cognitive and motor skills. Methods: 30 participants, 20-45 years old, completed the MSSQ and the CSQ-VR, and were immersed in VR. During immersion, they were exposed to a roller coaster ride. Before and after the ride, participants responded to CSQ-VR and performed VR-based cognitive and psychomotor tasks. Post-VR session, participants completed the CSQ-VR again. Results: Motion sickness susceptibility, during adulthood, was the most prominent predictor of cybersickness. Pupil dilation emerged as a significant predictor of cybersickness. Experience in videogaming was a significant predictor of both cybersickness and cognitive/motor functions. Cybersickness negatively affected visuospatial working memory and psychomotor skills. Overall cybersickness', nausea and vestibular symptoms' intensities significantly decreased after removing the VR headset. Conclusions: In order of importance, motion sickness susceptibility and gaming experience are significant predictors of cybersickness. Pupil dilation appears as a cybersickness' biomarker. Cybersickness negatively affects visuospatial working memory and psychomotor skills. Cybersickness and its effects on performance should be examined during and not after immersion.
Matroid intersection is a classical optimization problem where, given two matroids over the same ground set, the goal is to find the largest common independent set. In this paper, we show that there exists a certain "sparsifer": a subset of elements, of size $O(|S^{opt}| \cdot 1/\varepsilon)$, where $S^{opt}$ denotes the optimal solution, that is guaranteed to contain a $3/2 + \varepsilon$ approximation, while guaranteeing certain robustness properties. We call such a small subset a Density Constrained Subset (DCS), which is inspired by the Edge-Degree Constrained Subgraph (EDCS) [Bernstein and Stein, 2015], originally designed for the maximum cardinality matching problem in a graph. Our proof is constructive and hinges on a greedy decomposition of matroids, which we call the density-based decomposition. We show that this sparsifier has certain robustness properties that can be used in one-way communication and random-order streaming models.
Recommender systems are used to provide relevant suggestions on various matters. Although these systems are a classical research topic, knowledge is still limited regarding the public opinion about these systems. Public opinion is also important because the systems are known to cause various problems. To this end, this paper presents a qualitative analysis of the perceptions of ordinary citizens, civil society groups, businesses, and others on recommender systems in Europe. The dataset examined is based on the answers submitted to a consultation about the Digital Services Act (DSA) recently enacted in the European Union (EU). Therefore, not only does the paper contribute to the pressing question about regulating new technologies and online platforms, but it also reveals insights about the policy-making of the DSA. According to the qualitative results, Europeans have generally negative opinions about recommender systems and the quality of their recommendations. The systems are widely seen to violate privacy and other fundamental rights. According to many Europeans, these also cause various societal problems, including even threats to democracy. Furthermore, existing regulations in the EU are commonly seen to have failed due to a lack of proper enforcement. Numerous suggestions were made by the respondents to the consultation for improving the situation, but only a few of these ended up to the DSA.
Graph Convolutional Networks (GCNs) have been widely applied in various fields due to their significant power on processing graph-structured data. Typical GCN and its variants work under a homophily assumption (i.e., nodes with same class are prone to connect to each other), while ignoring the heterophily which exists in many real-world networks (i.e., nodes with different classes tend to form edges). Existing methods deal with heterophily by mainly aggregating higher-order neighborhoods or combing the immediate representations, which leads to noise and irrelevant information in the result. But these methods did not change the propagation mechanism which works under homophily assumption (that is a fundamental part of GCNs). This makes it difficult to distinguish the representation of nodes from different classes. To address this problem, in this paper we design a novel propagation mechanism, which can automatically change the propagation and aggregation process according to homophily or heterophily between node pairs. To adaptively learn the propagation process, we introduce two measurements of homophily degree between node pairs, which is learned based on topological and attribute information, respectively. Then we incorporate the learnable homophily degree into the graph convolution framework, which is trained in an end-to-end schema, enabling it to go beyond the assumption of homophily. More importantly, we theoretically prove that our model can constrain the similarity of representations between nodes according to their homophily degree. Experiments on seven real-world datasets demonstrate that this new approach outperforms the state-of-the-art methods under heterophily or low homophily, and gains competitive performance under homophily.
Substantial efforts have been devoted more recently to presenting various methods for object detection in optical remote sensing images. However, the current survey of datasets and deep learning based methods for object detection in optical remote sensing images is not adequate. Moreover, most of the existing datasets have some shortcomings, for example, the numbers of images and object categories are small scale, and the image diversity and variations are insufficient. These limitations greatly affect the development of deep learning based object detection methods. In the paper, we provide a comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities. Then, we propose a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images, which we name as DIOR. The dataset contains 23463 images and 192472 instances, covering 20 object classes. The proposed DIOR dataset 1) is large-scale on the object categories, on the object instance number, and on the total image number; 2) has a large range of object size variations, not only in terms of spatial resolutions, but also in the aspect of inter- and intra-class size variability across objects; 3) holds big variations as the images are obtained with different imaging conditions, weathers, seasons, and image quality; and 4) has high inter-class similarity and intra-class diversity. The proposed benchmark can help the researchers to develop and validate their data-driven methods. Finally, we evaluate several state-of-the-art approaches on our DIOR dataset to establish a baseline for future research.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.