亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

By converting the entire 3D space around the user into a screen, Extended Reality (XR) can ameliorate traditional displays' space limitations and facilitate the consumption of multiple pieces of information at a time. However, if designed inappropriately, these XR interfaces can overwhelm the user and complicate information access. In this work, we explored the design dimensions that can be adapted to enable suitable presentation and interaction within an XR interface. To investigate a specific use case of context-aware adaptations within our proposed design space, we concentrated on the spatial layout of the XR content and investigated non-adaptive and adaptive placement strategies. In this paper, we (1) present a comprehensive design space for XR interfaces, (2) propose Environment-referenced, an adaptive placement strategy that uses a relevant intermediary from the environment within a Hybrid Frame of Reference (FoR) for each XR object, and (3) evaluate the effectiveness of this adaptive placement strategy and a non-adaptive Body-Fixed placement strategy in four contextual scenarios varying in terms of social setting and user mobility in the environment. The performance of these placement strategies from our within-subjects user study emphasized the importance of intermediaries' relevance to the user's focus. These findings underscore the importance of context-aware interfaces, indicating that the appropriate use of an adaptive content placement strategy in a context can significantly improve task efficiency, accuracy, and usability.

相關內容

設計是對現有狀的一種重新認識和打破重組的過程,設計讓一切變得更美。

Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose a zero-shot $\textbf{M}$ulti-$\textbf{I}$nstance $\textbf{V}$ideo $\textbf{E}$diting framework, called MIVE. MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people). MIVE introduces two key modules: (i) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and (ii) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing. Additionally, we present our new MIVE Dataset featuring diverse video scenarios and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing. The project page is available at //kaist-viclab.github.io/mive-site/

DNN-based language models perform excellently on various tasks, but even SOTA LLMs are susceptible to textual adversarial attacks. Adversarial texts play crucial roles in multiple subfields of NLP. However, current research has the following issues. (1) Most textual adversarial attack methods target rich-resourced languages. How do we generate adversarial texts for less-studied languages? (2) Most textual adversarial attack methods are prone to generating invalid or ambiguous adversarial texts. How do we construct high-quality adversarial robustness benchmarks? (3) New language models may be immune to part of previously generated adversarial texts. How do we update adversarial robustness benchmarks? To address the above issues, we introduce HITL-GAT, a system based on a general approach to human-in-the-loop generation of adversarial texts. HITL-GAT contains four stages in one pipeline: victim model construction, adversarial example generation, high-quality benchmark construction, and adversarial robustness evaluation. Additionally, we utilize HITL-GAT to make a case study on Tibetan script which can be a reference for the adversarial research of other less-studied languages.

Given recent advancements of Large Language Models (LLMs), code generation tasks attract immense attention for wide application in different domains. In an effort to evaluate and select a best model to automatically remediate system incidents discovered by Application Performance Monitoring (APM) platforms, it is crucial to verify if the generated code is syntactically and semantically correct, and whether it can be executed correctly as intended. However, current methods for evaluating the quality of code generated by LLMs heavily rely on surface form similarity metrics (e.g. BLEU, ROUGE, and exact/partial match) which have numerous limitations. In contrast, execution based evaluation focuses more on code functionality and does not constrain the code generation to any fixed solution. Nevertheless, designing and implementing such execution-based evaluation platform is not a trivial task. There are several works creating execution-based evaluation platforms for popular programming languages such as SQL, Python, Java, but limited or no attempts for scripting languages such as Bash and PowerShell. In this paper, we present the first execution-based evaluation platform in which we created three test suites (total 125 handcrafted test cases) to evaluate Bash (both single-line commands and multiple-line scripts) and PowerShell codes generated by LLMs. We benchmark seven closed and open-source LLMs using our platform with different techniques (zero-shot vs. few-shot learning).

Sequential recommendation (SR) systems predict user preferences by analyzing time-ordered interaction sequences. A common challenge for SR is data sparsity, as users typically interact with only a limited number of items. While contrastive learning has been employed in previous approaches to address the challenges, these methods often adopt binary labels, missing finer patterns and overlooking detailed information in subsequent behaviors of users. Additionally, they rely on random sampling to select negatives in contrastive learning, which may not yield sufficiently hard negatives during later training stages. In this paper, we propose Future data utilization with Enduring Negatives for contrastive learning in sequential Recommendation (FENRec). Our approach aims to leverage future data with time-dependent soft labels and generate enduring hard negatives from existing data, thereby enhancing the effectiveness in tackling data sparsity. Experiment results demonstrate our state-of-the-art performance across four benchmark datasets, with an average improvement of 6.16\% across all metrics.

Website fingerprint (WF) attacks, which covertly monitor user communications to identify the web pages they visit, pose a serious threat to user privacy. Existing WF defenses attempt to reduce the attacker's accuracy by disrupting unique traffic patterns; however, they often suffer from the trade-off between overhead and effectiveness, resulting in less usefulness in practice. To overcome this limitation, we introduce Controllable Website Fingerprint Defense (CWFD), a novel defense perspective based on backdoor learning. CWFD exploits backdoor vulnerabilities in neural networks to directly control the attacker's model by designing trigger patterns based on network traffic. Specifically, CWFD injects only incoming packets on the server side into the target web page's traffic, keeping overhead low while effectively poisoning the attacker's model during training. During inference, the defender can influence the attacker's model through a 'red pill, blue pill' choice: traces with the trigger (red pill) lead to misclassification as the target web page, while normal traces (blue pill) are classified correctly, achieving directed control over the defense outcome. We use the Fast Levenshtein-like distance as the optimization objective to compute trigger patterns that can be effectively associated with our target page. Experiments show that CWFD significantly reduces RF's accuracy from 99% to 6% with 74% data overhead. In comparison, FRONT reduces accuracy to only 97% at similar overhead, while Palette achieves 32% accuracy with 48% more overhead. We further validate the practicality of our method in a real Tor network environment.

Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

Visual Place Recognition (VPR) aims to robustly identify locations by leveraging image retrieval based on descriptors encoded from environmental images. However, drastic appearance changes of images captured from different viewpoints at the same location pose incoherent supervision signals for descriptor learning, which severely hinder the performance of VPR. Previous work proposes classifying images based on manually defined rules or ground truth labels for viewpoints, followed by descriptor training based on the classification results. However, not all datasets have ground truth labels of viewpoints and manually defined rules may be suboptimal, leading to degraded descriptor performance.To address these challenges, we introduce the mutual learning of viewpoint self-classification and VPR. Starting from coarse classification based on geographical coordinates, we progress to finer classification of viewpoints using simple clustering techniques. The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition. Experimental results show that this approach almost perfectly partitions the dataset based on viewpoints, thus achieving mutually reinforcing effects. Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.

Tool-calling has changed Large Language Model (LLM) applications by integrating external tools, significantly enhancing their functionality across diverse tasks. However, this integration also introduces new security vulnerabilities, particularly in the tool scheduling mechanisms of LLM, which have not been extensively studied. To fill this gap, we present ToolCommander, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection. Our framework employs a well-designed two-stage attack strategy. Firstly, it injects malicious tools to collect user queries, then dynamically updates the injected tools based on the stolen information to enhance subsequent attacks. These stages enable ToolCommander to execute privacy theft, launch denial-of-service attacks, and even manipulate business competition by triggering unscheduled tool-calling. Notably, the ASR reaches 91.67% for privacy theft and hits 100% for denial-of-service and unscheduled tool calling in certain cases. Our work demonstrates that these vulnerabilities can lead to severe consequences beyond simple misuse of tool-calling systems, underscoring the urgent need for robust defensive strategies to secure LLM Tool-calling systems.

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

北京阿比特科技有限公司