Copy-move forgery on speech (CMF), coupled with post-processing techniques, presents a great challenge to the forensic detection and localization of tampered areas. Most of the existing CMF detection approaches necessitate pre-segmentation of speech to facilitate similarity calculations among these segments. However, these approaches usually suffer from the problems of uncontrollable computational complexity and sensitivity to the presence of a word that is read multiple times within a speech recording. To address these issues, we propose a local feature tensors-based CMF detection algorithm that can transform duplicate detection and localization problems into a special tensor-matching procedure, accompanied by complete theoretical analysis as support. Through extensive experimentation, we have demonstrated that our method exhibits computational efficiency and robustness against post-processing techniques. Notably, it can effectively and blindly detect tampered segments, even those as short as a fractional second. These advantages highlight the promising potential of our approach for practical applications.
Imitation learning from human demonstrations can teach robots complex manipulation skills, but is time-consuming and labor intensive. In contrast, Task and Motion Planning (TAMP) systems are automated and excel at solving long-horizon tasks, but they are difficult to apply to contact-rich tasks. In this paper, we present Human-in-the-Loop Task and Motion Planning (HITL-TAMP), a novel system that leverages the benefits of both approaches. The system employs a TAMP-gated control mechanism, which selectively gives and takes control to and from a human teleoperator. This enables the human teleoperator to manage a fleet of robots, maximizing data collection efficiency. The collected human data is then combined with an imitation learning framework to train a TAMP-gated policy, leading to superior performance compared to training on full task demonstrations. We compared HITL-TAMP to a conventional teleoperation system -- users gathered more than 3x the number of demos given the same time budget. Furthermore, proficient agents (75\%+ success) could be trained from just 10 minutes of non-expert teleoperation data. Finally, we collected 2.1K demos with HITL-TAMP across 12 contact-rich, long-horizon tasks and show that the system often produces near-perfect agents. Videos and additional results at //hitltamp.github.io .
5G New Radio (NR) has stringent demands on both performance and complexity for the design of low-density parity-check (LDPC) decoding algorithms and corresponding VLSI implementations. Furthermore, decoders must fully support the wide range of all 5G NR blocklengths and code rates, which is a significant challenge. In this paper, we present a high-performance and low-complexity LDPC decoder, tailor-made to fulfill the 5G requirements. First, to close the gap between belief propagation (BP) decoding and its approximations in hardware, we propose an extension of adjusted min-sum decoding, called generalized adjusted min-sum (GA-MS) decoding. This decoding algorithm flexibly truncates the incoming messages at the check node level and carefully approximates the non-linear functions of BP decoding to balance the error-rate and hardware complexity. Numerical results demonstrate that the proposed fixed-point GAMS has only a minor gap of 0.1 dB compared to floating-point BP under various scenarios of 5G standard specifications. Secondly, we present a fully reconfigurable 5G NR LDPC decoder implementation based on GA-MS decoding. Given that memory occupies a substantial portion of the decoder area, we adopt multiple data compression and approximation techniques to reduce 42.2% of the memory overhead. The corresponding 28nm FD-SOI ASIC decoder has a core area of 1.823 mm2 and operates at 895 MHz. It is compatible with all 5G NR LDPC codes and achieves a peak throughput of 24.42 Gbps and a maximum area efficiency of 13.40 Gbps/mm2 at 4 decoding iterations.
For graph self-supervised learning (GSSL), masked autoencoder (MAE) follows the generative paradigm and learns to reconstruct masked graph edges or node features. Contrastive Learning (CL) maximizes the similarity between augmented views of the same graph and is widely used for GSSL. However, MAE and CL are considered separately in existing works for GSSL. We observe that the MAE and CL paradigms are complementary and propose the graph contrastive masked autoencoder (GCMAE) framework to unify them. Specifically, by focusing on local edges or node features, MAE cannot capture global information of the graph and is sensitive to particular edges and features. On the contrary, CL excels in extracting global information because it considers the relation between graphs. As such, we equip GCMAE with an MAE branch and a CL branch, and the two branches share a common encoder, which allows the MAE branch to exploit the global information extracted by the CL branch. To force GCMAE to capture global graph structures, we train it to reconstruct the entire adjacency matrix instead of only the masked edges as in existing works. Moreover, a discrimination loss is proposed for feature reconstruction, which improves the disparity between node embeddings rather than reducing the reconstruction error to tackle the feature smoothing problem of MAE. We evaluate GCMAE on four popular graph tasks (i.e., node classification, node clustering, link prediction, and graph classification) and compare with 14 state-of-the-art baselines. The results show that GCMAE consistently provides good accuracy across these tasks, and the maximum accuracy improvement is up to 3.2% compared with the best-performing baseline.
Scaling dense PCFGs to thousands of nonterminals via a low-rank parameterization of the rule probability tensor has been shown to be beneficial for unsupervised parsing. However, PCFGs scaled this way still perform poorly as a language model, and even underperform similarly-sized HMMs. This work introduces \emph{SimplePCFG}, a simple PCFG formalism with independent left and right productions. Despite imposing a stronger independence assumption than the low-rank approach, we find that this formalism scales more effectively both as a language model and as an unsupervised parser. As an unsupervised parser, our simple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language model, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank PCFGs. We further introduce \emph{FlashInside}, a hardware IO-aware implementation of the inside algorithm for efficiently scaling simple PCFGs.
Current approaches in paraphrase generation and detection heavily rely on a single general similarity score, ignoring the intricate linguistic properties of language. This paper introduces two new tasks to address this shortcoming by considering paraphrase types - specific linguistic perturbations at particular text positions. We name these tasks Paraphrase Type Generation and Paraphrase Type Detection. Our results suggest that while current techniques perform well in a binary classification scenario, i.e., paraphrased or not, the inclusion of fine-grained paraphrase types poses a significant challenge. While most approaches are good at generating and detecting general semantic similar content, they fail to understand the intrinsic linguistic variables they manipulate. Models trained in generating and identifying paraphrase types also show improvements in tasks without them. In addition, scaling these models further improves their ability to understand paraphrase types. We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data, showing that only some encapsulate the relationship between information-theoretic compression, geometric compression, and ease-of-adaptation.
With the advancement of data-driven techniques, addressing continuous con-trol challenges has become more efficient. However, the reliance of these methods on historical data introduces the potential for unexpected decisions in novel scenarios. To enhance performance in autonomous driving and collision avoidance, we propose a symbiotic fusion of policy gradient with safety-based control. In this study, we em-ploy the Deep Deterministic Policy Gradient (DDPG) algorithm to enable autono-mous driving in the absence of surrounding vehicles. By training the vehicle's driving policy within a stable and familiar environment, a robust and efficient learning pro-cess is achieved. Subsequently, an artificial potential field approach is utilized to formulate a collision avoidance algorithm, accounting for the presence of surround-ing vehicles. Furthermore, meticulous consideration is given to path tracking meth-ods. The amalgamation of these approaches demonstrates substantial performance across diverse scenarios, underscoring its potential for advancing autonomous driving while upholding safety standards.
Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or fine-tuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is //github.com/fanq15/Few-Shot-Object-Detection-Dataset.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.