Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: //zju3dv.github.io/longvolcap.
Motor imagery (MI) based brain-computer interfaces (BCIs) enable the direct control of external devices through the imagined movements of various body parts. Unlike previous systems that used fixed-length EEG trials for MI decoding, asynchronous BCIs aim to detect the user's MI without explicit triggers. They are challenging to implement, because the algorithm needs to first distinguish between resting-states and MI trials, and then classify the MI trials into the correct task, all without any triggers. This paper proposes a sliding window prescreening and classification (SWPC) approach for MI-based asynchronous BCIs, which consists of two modules: a prescreening module to screen MI trials out of the resting-state, and a classification module for MI classification. Both modules are trained with supervised learning followed by self-supervised learning, which refines the feature extractors. Within-subject and cross-subject asynchronous MI classifications on four different EEG datasets validated the effectiveness of SWPC, i.e., it always achieved the highest average classification accuracy, and outperformed the best state-of-the-art baseline on each dataset by about 2%.
Large language models (LLMs) show promise in code translation due to their ability to generate idiomatic code. However, a significant limitation when using LLMs for code translation is scalability: existing works have shown a drop in translation success rates for code exceeding around 100 lines. We overcome this limitation by developing a modular approach to translation, where we partition the code into small code fragments which can be translated independently and semantically validated (that is, checking I/O equivalence). When this approach is applied naively, we discover that LLMs are unreliable when translating features of the source language that do not have a direct mapping to the target language, and that the LLM often gets stuck in repair loops when attempting to fix errors. To address these issues, we introduce two key concepts: (1) feature mapping, which integrates predefined translation rules with LLM-based translation to guide the LLM in navigating subtle language differences and producing semantically accurate code; and (2) type-compatibility, which facilitates localized checks at the function signature level to detect errors early, thereby narrowing the scope of potential repairs. We apply our approach to translating real-world Go codebases to Rust, demonstrating that we can consistently generate reliable Rust translations for projects up to 6,600 lines of code and 369 functions, with an average of 73% of functions successfully validated for I/O equivalence, considerably higher than any existing work.
Frontier models are currently developed and distributed primarily through two channels: centralized proprietary APIs or open-sourcing of pre-trained weights. We identify a third paradigm - Protocol Learning - where models are trained across decentralized networks of incentivized participants. This approach has the potential to aggregate orders of magnitude more computational resources than any single centralized entity, enabling unprecedented model scales and capabilities. However, it also introduces novel challenges: heterogeneous and unreliable nodes, malicious participants, the need for unextractable models to preserve incentives, and complex governance dynamics. To date, no systematic analysis has been conducted to assess the feasibility of Protocol Learning or the associated risks, particularly the 'No-Off Problem' arising from the inability to unilaterally halt a collectively trained model. We survey recent technical advances that suggest decentralized training may be feasible - covering emerging communication-efficient strategies and fault-tolerant methods - while highlighting critical open problems that remain. Contrary to the notion that decentralization inherently amplifies frontier risks, we argue that Protocol Learning's transparency, distributed governance, and democratized access ultimately reduce these risks compared to today's centralized regimes.
Communication over a classical multiple-access channel (MAC) with entanglement resources is considered, whereby two transmitters share entanglement resources a priori before communication begins. Leditzky et al. (2020) presented an example of a classical MAC, defined in terms of a pseudo telepathy game, such that the sum rate with entangled transmitters is strictly higher than the best achievable sum rate without such resources. Here, we establish inner and outer bounds on the capacity region for the general MAC with entangled transmitters, and show that the previous result can be obtained as a special case. It has long been known that the capacity region of the classical MAC under a message-average error criterion can be strictly larger than with a maximal error criterion (Dueck, 1978). We observe that given entanglement resources, the regions coincide. Furthermore, we address the combined setting of entanglement resources and conferencing, where the transmitters can also communicate with each other over rate-limited links. Using superdense coding, entanglement can double the conferencing rate.
Algebraic methods applied to the reconstruction of Sparse-view Computed Tomography (CT) can provide both a high image quality and a decrease in the dose received by patients, although with an increased reconstruction time since their computational costs are higher. In our work, we present a new algebraic implementation that obtains an exact solution to the system of linear equations that models the problem and based on single-precision floating-point arithmetic. By applying Out-Of-Core (OOC) techniques, the dimensions of the system can be increased regardless of the main memory size and as long as there is enough secondary storage (disk). These techniques have allowed to process images of 768 x 768$ pixels. A comparative study of our method on a GPU using both single-precision and double-precision arithmetic has been carried out. The goal is to assess the single-precision arithmetic implementation both in terms of time improvement and quality of the reconstructed images to determine if it is sufficient to consider it a viable option. Results using single-precision arithmetic approximately halves the reconstruction time of the double-precision implementation, whereas the obtained images retain all internal structures despite having higher noise levels.
Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.
We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.