Recent years have witnessed an exponential increase in the demand for face video compression, and the success of artificial intelligence has expanded the boundaries beyond traditional hybrid video coding. Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs, leveraging the statistical priors of face videos. However, the great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA). In this paper, we introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos. The database contains 3,240 compressed face video clips in multiple compression levels, which are derived from 135 source videos with diversified content using six representative video codecs, including two traditional methods based on hybrid coding frameworks, two end-to-end methods, and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index for face video compression was developed to measure the perceptual quality, considering the distinct content characteristics and temporal priors of the face videos. Experimental results exhibit its superior performance on the proposed CFVQA dataset. The benchmark is now made publicly available at: //github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.
Eye typing interfaces enable a person to enter text into an interface using only their own eyes. But despite the inherent advantages of touchless operation and intuitive design, such eye-typing interfaces often suffer from slow typing speeds, resulting in slow words per minute (WPM) counts. In this study, we add word and letter prediction to the eye-typing interface and investigate users' typing performance as well as their subjective experience while using the interface. In experiment 1, we compared three typing interfaces with letter prediction (LP), letter+word prediction (L+WP), and no prediction (NoP), respectively. We found that the interface with L+WP achieved the highest average text entry speed (5.48 WPM), followed by the interface with LP (3.42 WPM), and the interface with NoP (3.39 WPM). Participants were able to quickly understand the procedural design for word prediction and perceived this function as very helpful. Compared to LP and NoP, participants needed more time to familiarize themselves with L+WP in order to reach a plateau regarding text entry speed. Experiment 2 explored training effects in L+WP interfaces. Two moving speeds were implemented: slow (6.4{\deg}/s same speed as in experiment 1) and fast (10{\deg}/s). The study employed a mixed experimental design, incorporating moving speeds as a between-subjects factor, to evaluate its influence on typing performance throughout 10 consecutive training sessions. The results showed that the typing speed reached 6.17 WPM for the slow group and 7.35 WPM for the fast group after practice. Overall, the two experiments show that adding letter and word prediction to eye-typing interfaces increases typing speeds. We also find that more extended training is required to achieve these high typing speeds.
The resilience of internet service is crucial for ensuring consistent communication, facilitating emergency response in digitally-dependent society. Due to empirical data constraints, there has been limited research on internet service disruptions during extreme weather events. To bridge this gap, this study utilizes observational datasets on internet performance to quantitatively assess extent of internet disruption during two recent extreme weather events. Taking Harris County in United States as study region, we jointly analyzed the hazard severity and the associated internet disruptions in two extreme weather events. The results show that hazard events significantly impacted regional internet connectivity. There exists a pronounced temporal synchronicity between magnitude of disruption and hazard severity: as severity of hazards intensifies, internet disruptions correspondingly escalate, and eventually return to baseline levels post-event. Spatial analyses show internet service disruptions can happen even in areas not directly impacted by hazards, demonstrating that repercussions of hazards extend beyond immediate area of impact. This interplay of temporal synchronization and spatial variance underscores complex relationships between hazard severity and Internet disruption. Socio-demographic analysis suggests vulnerable communities, already grappling with myriad challenges, face exacerbated service disruptions during hazard events, emphasizing the need for prioritized disaster mitigation strategiesfor improving the resilience of internet services. To the best of our knowledge, this research is among the first studies to examine the Internet disruptions during hazardous events using a quantitative observational dataset. Insights obtained hold significant implications for city administrators, guiding them towards more resilient and equitable infrastructure planning.
The integration of advanced video codecs into the streaming pipeline is growing in response to the increasing demand for high quality video content. However, the significant computational demand for advanced codecs like Versatile Video Coding (VVC) poses challenges for service providers, including longer encoding time and higher encoding cost. This challenge becomes even more pronounced in streaming, as the same content needs to be encoded at multiple bitrates (also known as representations) to accommodate different network conditions. To accelerate the encoding process of multiple representations of the same content in VVC, we employ the encoding map of a single representation, known as the reference representation, and utilize its partitioning structure to accelerate the encoding of the remaining representations, referred to as dependent representations. To ensure compatibility with parallel processing, we designate the lowest bitrate representation as the reference representation. The experimental results indicate a substantial improvement in the encoding time for the dependent representations, achieving an average reduction of 40%, while maintaining a minimal average quality drop of only 0.43 in Video Multi-method Assessment Fusion (VMAF). This improvement is observed when utilizing Versatile Video Encoder (VVenC), an open and optimized VVC encoder implementation.
Pilot studies are an essential cornerstone of the design of crowdsourcing campaigns, yet they are often only mentioned in passing in the scholarly literature. A lack of details surrounding pilot studies in crowdsourcing research hinders the replication of studies and the reproduction of findings, stalling potential scientific advances. We conducted a systematic literature review on the current state of pilot study reporting at the intersection of crowdsourcing and HCI research. Our review of ten years of literature included 171 articles published in the proceedings of the Conference on Human Computation and Crowdsourcing (AAAI HCOMP) and the ACM Digital Library. We found that pilot studies in crowdsourcing research (i.e., crowd pilot studies) are often under-reported in the literature. Important details, such as the number of workers and rewards to workers, are often not reported. On the basis of our findings, we reflect on the current state of practice and formulate a set of best practice guidelines for reporting crowd pilot studies in crowdsourcing research. We also provide implications for the design of crowdsourcing platforms and make practical suggestions for supporting crowd pilot study reporting.
The advent of large language models marks a revolutionary breakthrough in artificial intelligence. With the unprecedented scale of training and model parameters, the capability of large language models has been dramatically improved, leading to human-like performances in understanding, language synthesizing, and common-sense reasoning, etc. Such a major leap-forward in general AI capacity will change the pattern of how personalization is conducted. For one thing, it will reform the way of interaction between humans and personalization systems. Instead of being a passive medium of information filtering, large language models present the foundation for active user engagement. On top of such a new foundation, user requests can be proactively explored, and user's required information can be delivered in a natural and explainable way. For another thing, it will also considerably expand the scope of personalization, making it grow from the sole function of collecting personalized information to the compound function of providing personalized services. By leveraging large language models as general-purpose interface, the personalization systems may compile user requests into plans, calls the functions of external tools to execute the plans, and integrate the tools' outputs to complete the end-to-end personalization tasks. Today, large language models are still being developed, whereas the application in personalization is largely unexplored. Therefore, we consider it to be the right time to review the challenges in personalization and the opportunities to address them with LLMs. In particular, we dedicate this perspective paper to the discussion of the following aspects: the development and challenges for the existing personalization system, the newly emerged capabilities of large language models, and the potential ways of making use of large language models for personalization.
Face recognition technology has advanced significantly in recent years due largely to the availability of large and increasingly complex training datasets for use in deep learning models. These datasets, however, typically comprise images scraped from news sites or social media platforms and, therefore, have limited utility in more advanced security, forensics, and military applications. These applications require lower resolution, longer ranges, and elevated viewpoints. To meet these critical needs, we collected and curated the first and second subsets of a large multi-modal biometric dataset designed for use in the research and development (R&D) of biometric recognition technologies under extremely challenging conditions. Thus far, the dataset includes more than 350,000 still images and over 1,300 hours of video footage of approximately 1,000 subjects. To collect this data, we used Nikon DSLR cameras, a variety of commercial surveillance cameras, specialized long-rage R&D cameras, and Group 1 and Group 2 UAV platforms. The goal is to support the development of algorithms capable of accurately recognizing people at ranges up to 1,000 m and from high angles of elevation. These advances will include improvements to the state of the art in face recognition and will support new research in the area of whole-body recognition using methods based on gait and anthropometry. This paper describes methods used to collect and curate the dataset, and the dataset's characteristics at the current stage.
Graph neural networks (GNNs) have demonstrated a significant boost in prediction performance on graph data. At the same time, the predictions made by these models are often hard to interpret. In that regard, many efforts have been made to explain the prediction mechanisms of these models from perspectives such as GNNExplainer, XGNN and PGExplainer. Although such works present systematic frameworks to interpret GNNs, a holistic review for explainable GNNs is unavailable. In this survey, we present a comprehensive review of explainability techniques developed for GNNs. We focus on explainable graph neural networks and categorize them based on the use of explainable methods. We further provide the common performance metrics for GNNs explanations and point out several future research directions.
With the advent of 5G commercialization, the need for more reliable, faster, and intelligent telecommunication systems are envisaged for the next generation beyond 5G (B5G) radio access technologies. Artificial Intelligence (AI) and Machine Learning (ML) are not just immensely popular in the service layer applications but also have been proposed as essential enablers in many aspects of B5G networks, from IoT devices and edge computing to cloud-based infrastructures. However, most of the existing surveys in B5G security focus on the performance of AI/ML models and their accuracy, but they often overlook the accountability and trustworthiness of the models' decisions. Explainable AI (XAI) methods are promising techniques that would allow system developers to identify the internal workings of AI/ML black-box models. The goal of using XAI in the security domain of B5G is to allow the decision-making processes of the security of systems to be transparent and comprehensible to stakeholders making the systems accountable for automated actions. In every facet of the forthcoming B5G era, including B5G technologies such as RAN, zero-touch network management, E2E slicing, this survey emphasizes the role of XAI in them and the use cases that the general users would ultimately enjoy. Furthermore, we presented the lessons learned from recent efforts and future research directions on top of the currently conducted projects involving XAI.
Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs.