We present ALFRED: a virtual memory abstraction that resolves the dichotomy between volatile and non-volatile memory in intermittent computing. Mixed-volatile microcontrollers allow programmers to allocate part of the application state onto non-volatile main memory. Programmers are therefore to explore manually the trade-off between simpler management of persistent state against the energy overhead for non-volatile memory operations and intermittence anomalies due to re-execution of non-idempotent code. This approach is laborious and yields sub-optimal performance. We take a different stand with ALFRED: we provide programmers with a virtual memory abstraction detached from the specific volatile nature of memory and automatically determine an efficient mapping from virtual to volatile or non-volatile memory. Unlike existing works, ALFRED does not require programmers to learn a new programming model or language syntax, while the mapping is entirely resolved at compile-time, reducing the run-time energy overhead. We implement ALFRED through a series of program machine-level code transformations. Compared to existing systems, we demonstrate that ALFRED reduces energy consumption by up to two orders of magnitude given a fixed workload. This enables the workloads to finish sooner, as the use of available energy shifts from ensuring forward progress to useful application processing.
We show how to translate a subset of RISC-V machine code compiled from a subset of C to quadratic unconstrained binary optimization (QUBO) models that may be solved by a quantum annealing machine: given a bound $n$, there is input $I$ to a program $P$ such that $P$ runs into a given program state $E$ executing no more than $n$ machine instructions if and only if the QUBO model of $P$ for $n$ evaluates to 0 on $I$. Thus, with more qubits on the machine than variables in the QUBO model, quantum annealing the model reaches 0 (ground) energy in constant time with high probability on some input $I$ that is part of the ground state if and only if $P$ runs into $E$ on $I$ executing no more than $n$ instructions. Translation takes $\mathcal{O}(n^2)$ time effectively turning a quantum annealer into a polynomial-time symbolic execution engine and bounded model checker, eliminating their path and state explosion problems. Here, we take advantage of the fact that any machine instruction may only increase the size of the program state by a constant amount of bits. Translation time comes down from $\mathcal{O}(n^2)$ to $\mathcal{O}(n\cdot|P|)$ if memory consumption of $P$ is bounded by a constant, establishing a linear (quadratic) upper bound on quantum space, in number of qubits on a quantum annealer, in terms of algorithmic time (space) in classical computing. The construction provides a non-relativizing argument for $NP\subseteq BQP$, without violating the optimality of Grover's algorithm, also on gate-model quantum machines, and motivates a temporal and spatial metric of quantum advantage. Our prototypical open-source toolchain translates machine code that runs on real RISC-V hardware to models that can be solved by real quantum annealing hardware, as shown in our experiments.
PCM is a popular backing memory for DRAM main memory in tiered memory systems. PCM has asymmetric access energy; writes dominate reads. MLC asymmetry can vary by an order of magnitude. Many schemes have been developed to take advantage of the asymmetric patterns of 0s and 1s in the data to reduce write energy. Because the memory is non-volatile, data can be recovered via physical attack or across system reboot cycles. To protect information stored in PCM against these attacks requires encryption. Unfortunately, most encryption algorithms scramble 0s and 1s in the data, effectively removing any patterns and negatively impacting schemes that leverage data bias and similarity to reduce write energy. In this paper, we introduce Virtual Coset Coding (VCC) as a workload-independent approach that reduces costly symbol transitions for storing encrypted data. VCC is based on two ideas. First, using coset encoding with random coset candidates, it is possible to effectively reduce the frequency of costly bit/symbol transitions when writing encrypted data. Second, a small set of random substrings can be used to achieve the same encoding efficiency as a large number of random coset candidates, but at a much lower encoding/decoding cost. Additionally, we demonstrate how VCC can be leveraged for energy reduction in combination with fault-mitigation and fault-tolerance to dramatically increase the lifetimes of endurance-limited NVMs, such as PCM. We evaluate the design of VCC and demonstrate that it can be implemented on-chip with only a nominal area overhead. VCC reduces dynamic energy by 22-28% while maintaining the same performance. Using our multi-objective optimization approach achieves at least a 36% improvement in lifetime over the state-of-the-art and at least a 50% improvement in lifetime vs. an unencoded memory, while maintaining its energy savings and system performance.
We investigate several rank-based change-point procedures for the covariance operator in a sequence of observed functions, called FKWC change-point procedures. Our methods allow the user to test for one change-point, to test for an epidemic period, or to detect an unknown amount of change-points in the data. Our methodology combines functional data depth values with the traditional Kruskal Wallis test statistic. By taking this approach we have no need to estimate the covariance operator, which makes our methods computationally cheap. For example, our procedure can identify multiple change-points in $O(n\log n)$ time. Our procedure is fully non-parametric and is robust to outliers through the use of data depth ranks. We show that when $n$ is large, our methods have simple behaviour under the null hypothesis.We also show that the FKWC change-point procedures are $n^{-1/2}$-consistent. In addition to asymptotic results, we provide a finite sample accuracy result for our at-most-one change-point estimator. In simulation, we compare our methods against several others. We also present an application of our methods to intraday asset returns and f-MRI scans.
We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym.
Federated learning is a new distributed machine learning framework, where a bunch of heterogeneous clients collaboratively train a model without sharing training data. In this work, we consider a practical and ubiquitous issue in federated learning: intermittent client availability, where the set of eligible clients may change during the training process. Such an intermittent client availability model would significantly deteriorate the performance of the classical Federated Averaging algorithm (FedAvg for short). We propose a simple distributed non-convex optimization algorithm, called Federated Latest Averaging (FedLaAvg for short), which leverages the latest gradients of all clients, even when the clients are not available, to jointly update the global model in each iteration. Our theoretical analysis shows that FedLaAvg attains the convergence rate of $O(1/(N^{1/4} T^{1/2}))$, achieving a sublinear speedup with respect to the total number of clients. We implement and evaluate FedLaAvg with the CIFAR-10 dataset. The evaluation results demonstrate that FedLaAvg indeed reaches a sublinear speedup and achieves 4.23% higher test accuracy than FedAvg.
We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation. In the proposed framework, we differentiate two types of proximity relations: direct proximity and k-th order neighborhood proximity. While learning from the former exploits direct user-item associations observable from the graph, learning from the latter makes use of implicit associations such as user-user similarities and item-item similarities, which can provide valuable information especially when the graph is sparse. Moreover, for improving scalability and flexibility, we propose a sampling technique that is specifically designed to capture the two types of proximity relations. Extensive experiments on eight benchmark datasets show that CSE yields significantly better performance than state-of-the-art recommendation methods.
Deep embedding learning becomes more attractive for discriminative feature learning, but many methods still require hard-class mining, which is computationally complex and performance-sensitive. To this end, we propose Adaptive Large Margin N-Pair loss (ALMN) to address the aforementioned issues. Instead of exploring hard example-mining strategy, we introduce the concept of large margin constraint. This constraint aims at encouraging local-adaptive large angular decision margin among dissimilar samples in multimodal feature space so as to significantly encourage intraclass compactness and interclass separability. And it is mainly achieved by a simple yet novel geometrical Virtual Point Generating (VPG) method, which converts artificially setting a fixed margin into automatically generating a boundary training sample in feature space and is an open question. We demonstrate the effectiveness of our method on several popular datasets for image retrieval and clustering tasks.
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters
When deploying resource-intensive signal processing applications in wireless sensor or mesh networks, distributing processing blocks over multiple nodes becomes promising. Such distributed applications need to solve the placement problem (which block to run on which node), the routing problem (which link between blocks to map on which path between nodes), and the scheduling problem (which transmission is active when). We investigate a variant where the application graph may contain feedback loops and we exploit wireless networks? inherent multicast advantage. Thus, we propose Multicast-Aware Routing for Virtual network Embedding with Loops in Overlays (MARVELO) to find efficient solutions for scheduling and routing under a detailed interference model. We cast this as a mixed integer quadratically constrained optimisation problem and provide an efficient heuristic. Simulations show that our approach handles complex scenarios quickly.
Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.