We introduce SHARCS for adaptive inference that takes into account the hardness of input samples. SHARCS can train a router on any transformer network, enabling the model to direct different samples to sub-networks with varying widths. Our experiments demonstrate that: (1) SHARCS outperforms or complements existing per-sample adaptive inference methods across various classification tasks in terms of accuracy vs. FLOPs; (2) SHARCS generalizes across different architectures and can be even applied to compressed and efficient transformer encoders to further improve their efficiency; (3) SHARCS can provide a 2 times inference speed up at an insignificant drop in accuracy.
Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA \citep{singh_flava_2022}. In accordance with Babylm \citep{warstadt2023papers} guidelines, we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset \citep{galvez_peoples_2021}. To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.
Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.
Recent work in NLP has shown promising results in training models on large amounts of tasks to achieve better generalization. However, it is not well-understood how tasks are related, and how helpful training tasks can be chosen for a new task. In this work, we investigate whether knowing task relationships via pairwise task transfer improves choosing one or more source tasks that help to learn a new target task. We provide TaskWeb, a large-scale benchmark of pairwise task transfers for 22 NLP tasks using three different model types, sizes, and adaptation methods, spanning about 25,000 experiments. Then, we design a new method TaskShop based on our analysis of TaskWeb. TaskShop uses TaskWeb to estimate the benefit of using a source task for learning a new target task, and to choose a subset of helpful training tasks for multi-task training. Our method improves overall rankings and top-k precision of source tasks by 10% and 38%, respectively. We also use TaskShop to build much smaller multi-task training sets that improve zero-shot performances across 11 different target tasks by at least 4.3%.
Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.
Neuromorphic computing and, in particular, spiking neural networks (SNNs) have become an attractive alternative to deep neural networks for a broad range of signal processing applications, processing static and/or temporal inputs from different sensory modalities, including audio and vision sensors. In this paper, we start with a description of recent advances in algorithmic and optimization innovations to efficiently train and scale low-latency, and energy-efficient spiking neural networks (SNNs) for complex machine learning applications. We then discuss the recent efforts in algorithm-architecture co-design that explores the inherent trade-offs between achieving high energy-efficiency and low latency while still providing high accuracy and trustworthiness. We then describe the underlying hardware that has been developed to leverage such algorithmic innovations in an efficient way. In particular, we describe a hybrid method to integrate significant portions of the model's computation within both memory components as well as the sensor itself. Finally, we discuss the potential path forward for research in building deployable SNN systems identifying key challenges in the algorithm-hardware-application co-design space with an emphasis on trustworthiness.
In recent years, circuit simulators and Boolean satisfiability (SAT) solvers have been tightly integrated to provide efficient logic synthesis and verification. Circuit simulation can generate highly expressive simulation patterns that can either enumerate or filter out most candidates for synthesis. Subsequently, SAT solvers are employed to check those that remain, thereby making the logic synthesis process more efficient. This paper introduces a novel circuit simulator of k-input lookup table (k-LUT) networks, based on semi-tensor product (STP). STP-based simulators use computation of logic matrices, the primitives of logic networks, as opposed to relying on bitwise logic operations for simulation of k-LUT networks. Experimental results show that our STP-based simulator reduces the runtime by an average of 7.2x. Furthermore, we integrate this proposed simulator into a SAT-sweeping engine known as SAT sweeper. Through a combination of structural hashing, simulation, and SAT queries, SAT sweeper simplifies logic networks by systematically merging graph vertices from input to output. To enhance the efficiency, we used STP-based exhaustive simulation, which significantly reduces the number of false equivalence class candidates, thereby improving the computational efficiency by reducing the number of SAT calls required. When compared to the SOTA SAT sweeper, our method demonstrates an average 35% runtime reduction.
We show that sharp thresholds for Boolean functions directly imply average-case circuit lower bounds. More formally we show that any Boolean function exhibiting a sharp enough threshold at \emph{arbitrary} critical density cannot be computed by Boolean circuits of bounded depth and polynomial size. Our general result implies new average-case bounded depth circuit lower bounds in a variety of settings. (a) ($k$-cliques) For $k=\Theta(n)$, we prove that any circuit of depth $d$ deciding the presence of a size $k$ clique in a random graph requires exponential-in-$n^{\Theta(1/d)}$ size. To the best of our knowledge, this is the first average-case exponential size lower bound for bounded depth (not necessarily monotone) circuits solving the fundamental $k$-clique problem (for any $k=k_n$). (b)(random 2-SAT) We prove that any circuit of depth $d$ deciding the satisfiability of a random 2-SAT formula requires exponential-in-$n^{\Theta(1/d)}$ size. To the best of our knowledge, this is the first bounded depth circuit lower bound for random $k$-SAT for any value of $k \geq 2.$ Our results also provide the first rigorous lower bound in agreement with a conjectured, but debated, ``computational hardness'' of random $k$-SAT around its satisfiability threshold. (c)(Statistical estimation -- planted $k$-clique) Over the recent years, multiple statistical estimation problems have also been proven to exhibit a ``statistical'' sharp threshold, called the All-or-Nothing (AoN) phenomenon. We show that AoN also implies circuit lower bounds for statistical problems. As a simple corollary of that, we prove that any circuit of depth $d$ that solves to information-theoretic optimality a ``dense'' variant of the celebrated planted $k$-clique problem requires exponential-in-$n^{\Theta(1/d)}$ size.
Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at //github.com/Albert-Ma/PROP.
Deep learning techniques have received much attention in the area of image denoising. However, there are substantial differences in the various types of deep learning methods dealing with image denoising. Specifically, discriminative learning based on deep learning can ably address the issue of Gaussian noise. Optimization models based on deep learning are effective in estimating the real noise. However, there has thus far been little related research to summarize the different deep learning techniques for image denoising. In this paper, we offer a comparative study of deep techniques in image denoising. We first classify the deep convolutional neural networks (CNNs) for additive white noisy images; the deep CNNs for real noisy images; the deep CNNs for blind denoising and the deep CNNs for hybrid noisy images, which represents the combination of noisy, blurred and low-resolution images. Then, we analyze the motivations and principles of the different types of deep learning methods. Next, we compare the state-of-the-art methods on public denoising datasets in terms of quantitative and qualitative analysis. Finally, we point out some potential challenges and directions of future research.
With the rapid growth of knowledge bases (KBs), question answering over knowledge base, a.k.a. KBQA has drawn huge attention in recent years. Most of the existing KBQA methods follow so called encoder-compare framework. They map the question and the KB facts to a common embedding space, in which the similarity between the question vector and the fact vectors can be conveniently computed. This, however, inevitably loses original words interaction information. To preserve more original information, we propose an attentive recurrent neural network with similarity matrix based convolutional neural network (AR-SMCNN) model, which is able to capture comprehensive hierarchical information utilizing the advantages of both RNN and CNN. We use RNN to capture semantic-level correlation by its sequential modeling nature, and use an attention mechanism to keep track of the entities and relations simultaneously. Meanwhile, we use a similarity matrix based CNN with two-directions pooling to extract literal-level words interaction matching utilizing CNNs strength of modeling spatial correlation among data. Moreover, we have developed a new heuristic extension method for entity detection, which significantly decreases the effect of noise. Our method has outperformed the state-of-the-arts on SimpleQuestion benchmark in both accuracy and efficiency.