This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal of this challenge is to synthesise natural and high-quality speech in French, from a large monospeaker dataset (hub task) and from a smaller dataset by speaker adaptation (spoke task). We participated to both tasks with the same model architecture. Our approach has been to use an auto-regressive model, which retains an advantage for generating natural sounding speech but to improve prosodic control in several ways. Similarly to non-attentive Tacotron, the model uses a duration predictor and gaussian upsampling at inference, but with a simpler unsupervised training. We also model the speaking style at both sentence and word levels by extracting global and local style tokens from the reference speech. At inference, the global and local style tokens are predicted from a BERT model run on text. This BERT model is also used to predict specific pronunciation features like schwa elision and optional liaisons. Finally, a modified version of HifiGAN trained on a large public dataset and fine-tuned on the target voices is used to generate speech waveform. Our team is identified as O in the the Blizzard evaluation and MUSHRA test results show that our system performs second ex aequo in both hub task (median score of 0.75) and spoke task (median score of 0.68), over 18 and 14 participants, respectively.
Trajectories are optimized for a two-dimensional simplified skateboarding system to allow it to perform a fundamental skateboarding trick called an "ollie". A methodology for generating trick trajectories by controlling the position of a point-mass relative to a board is presented and demonstrated over a range of peak jump heights. A hybrid dynamics approach is taken to perform this optimization, with contact constraints applied along a sequence of discrete timesteps based on the board's position throughout designated sections of the trick. These constraints introduce explicit and implicit discontinuities between chosen sections of the trick sequence. The approach has been shown to be successful for a set of realistic system parameters.
In this paper, we provide simpler reductions from Exact Triangle to two important problems in fine-grained complexity: Exact Triangle with Few Zero-Weight $4$-Cycles and All-Edges Sparse Triangle. Exact Triangle instances with few zero-weight $4$-cycles was considered by Jin and Xu [STOC 2023], who used it as an intermediate problem to show $3$SUM hardness of All-Edges Sparse Triangle with few $4$-cycles (independently obtained by Abboud, Bringmann and Fischer [STOC 2023]), which is further used to show $3$SUM hardness of a variety of problems, including $4$-Cycle Enumeration, Offline Approximate Distance Oracle, Dynamic Approximate Shortest Paths and All-Nodes Shortest Cycles. We provide a simple reduction from Exact Triangle to Exact Triangle with few zero-weight $4$-cycles. Our new reduction not only simplifies Jin and Xu's previous reduction, but also strengthens the conditional lower bounds from being under the $3$SUM hypothesis to the even more believable Exact Triangle hypothesis. As a result, all conditional lower bounds shown by Jin and Xu [STOC 2023] and by Abboud, Bringmann and Fischer [STOC 2023] using All-Edges Sparse Triangle with few $4$-cycles as an intermediate problem now also hold under the Exact Triangle hypothesis. We also provide two alternative proofs of the conditional lower bound of the All-Edges Sparse Triangle problem under the Exact Triangle hypothesis, which was originally proved by Vassilevska Williams and Xu [FOCS 2020]. Both of our new reductions are simpler, and one of them is also deterministic -- all previous reductions from Exact Triangle or 3SUM to All-Edges Sparse Triangle (including P\u{a}tra\c{s}cu's seminal work [STOC 2010]) were randomized.
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.
Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger (around 92%) when adapting the network to different recording sessions from the same speaker.
Known simulations of random access machines (RAMs) or parallel RAMs (PRAMs) by Boolean circuits incur significant polynomial blowup, due to the need to repeatedly simulate accesses to a large main memory. Consider a single modification to Boolean circuits that removes the restriction that circuit graphs are acyclic. We call this the cyclic circuit model. Note, cyclic circuits remain combinational, as they do not allow wire values to change over time. We simulate PRAM with a cyclic circuit, and the blowup from our simulation is only polylogarithmic. Consider a PRAM program $P$ that on a length-$n$ input uses an arbitrary number of processors to manipulate words of size $\Theta(\log n)$ bits and then halts within $W(n)$ work. We construct a size-$O(W(n)\cdot \log^4 n)$ cyclic circuit that simulates $P$. Suppose that on a particular input, $P$ halts in time $T$; our circuit computes the same output within $T \cdot O(\log^3 n)$ gate delay. This implies theoretical feasibility of powerful parallel machines. Cyclic circuits can be implemented in hardware, and our circuit achieves performance within polylog factors of PRAM. Our simulated PRAM synchronizes processors via logical dependencies between wires.
This paper aims to explore the potential of combining Deep Reinforcement Learning (DRL) with Knowledge Distillation (KD) by distilling various DRL algorithms and studying their distillation effects. By doing so, the computational burden of deep models could be reduced while maintaining the performance. The primary objective is to provide a benchmark for evaluating the performance of different DRL algorithms that have been refined using KD techniques. By distilling these algorithms, the goal is to develop efficient and fast DRL models. This research is expected to provide valuable insights that can facilitate further advancements in this promising direction. By exploring the combination of DRL and KD, this work aims to promote the development of models that require fewer GPU resources, learn more quickly, and make faster decisions in complex environments. The results of this research have the capacity to significantly advance the field of DRL and pave the way for the future deployment of resource-efficient, decision-making intelligent systems.
Recently, Large Language Models (LLMs) have shown impressive results in code generation. However, existing decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming languages (PL). Due to this oversight, a better decoding strategy for code generation remains an open question. In this paper, we conduct the first systematic study to explore a decoding strategy specialized in code generation. With an analysis of loss distributions of code tokens, we find that code tokens can be divided into two categories: challenging tokens that are difficult to predict and confident tokens that can be easily inferred. Among them, the challenging tokens mainly appear at the beginning of a code block. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling, which dynamically adjusts the temperature coefficient when decoding different tokens. We apply a larger temperature when sampling for challenging tokens, allowing LLMs to explore diverse choices. We employ a smaller temperature for confident tokens avoiding the influence of tail randomness noises. We apply AdapT sampling to LLMs with different sizes and conduct evaluations on two popular datasets. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
Semantic role labeling (SRL) has multiple disjoint label sets, e.g., VerbNet and PropBank. Creating these datasets is challenging, therefore a natural question is how to use each one to help the other. Prior work has shown that cross-task interaction helps, but only explored multitask learning so far. A common issue with multi-task setup is that argument sequences are still separately decoded, running the risk of generating structurally inconsistent label sequences (as per lexicons like Semlink). In this paper, we eliminate such issue with a framework that jointly models VerbNet and PropBank labels as one sequence. In this setup, we show that enforcing Semlink constraints during decoding constantly improves the overall F1. With special input constructions, our joint model infers VerbNet arguments from given PropBank arguments with over 99 F1. For learning, we propose a constrained marginal model that learns with knowledge defined in Semlink to further benefit from the large amounts of PropBank-only data. On the joint benchmark based on CoNLL05, our models achieve state-of-the-art F1's, outperforming the prior best in-domain model by 3.5 (VerbNet) and 0.8 (PropBank). For out-of-domain generalization, our models surpass the prior best by 3.4 (VerbNet) and 0.2 (PropBank).
This paper explores meta-learning in sequential recommendation to alleviate the item cold-start problem. Sequential recommendation aims to capture user's dynamic preferences based on historical behavior sequences and acts as a key component of most online recommendation scenarios. However, most previous methods have trouble recommending cold-start items, which are prevalent in those scenarios. As there is generally no side information in the setting of sequential recommendation task, previous cold-start methods could not be applied when only user-item interactions are available. Thus, we propose a Meta-learning-based Cold-Start Sequential Recommendation Framework, namely Mecos, to mitigate the item cold-start problem in sequential recommendation. This task is non-trivial as it targets at an important problem in a novel and challenging context. Mecos effectively extracts user preference from limited interactions and learns to match the target cold-start item with the potential user. Besides, our framework can be painlessly integrated with neural network-based models. Extensive experiments conducted on three real-world datasets verify the superiority of Mecos, with the average improvement up to 99%, 91%, and 70% in HR@10 over state-of-the-art baseline methods.
We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.