Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) is a novel technology which enables the full-space coverage. In this letter, a multi STAR-RIS-aided system using non-orthogonal multiple access in an uplink transmission is considered, where the multi-order reflections among multiple STAR-RISs assist the transmission from the single-antenna users to the multi-antenna base station. Specifically, the total sum rate maximization problem is solved by jointly optimizing the active beamforming, power allocation, transmission and reflection beamforming at the STAR-RIS, and user-STAR-RIS assignment. To solve the non-convex optimization problem, a novel deep reinforcement learning algorithm is proposed which integrates meta-learning and deep deterministic policy gradient (DDPG), denoted by Meta-DDPG. Numerical results demonstrate that our proposed Meta-DDPG algorithm outperforms the conventional DDPG algorithm with $53\%$ improvement, while multi-order reflections among multi STAR-RISs yields to $14.1\%$ enhancement in the total data rate.
Retrieval-augmented Large Language Models (LLMs) have reshaped traditional query-answering systems, offering unparalleled user experiences. However, existing retrieval techniques often struggle to handle multi-modal query contexts. In this paper, we present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index, integrated with cutting-edge LLMs. It comprises five core components: Data Preprocessing, Vector Representation, Index Construction, Query Execution, and Answer Generation, all orchestrated by a dedicated coordinator to ensure smooth data flow from input to answer generation. One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities, facilitating precise measurement of multi-modal information similarity. Furthermore, the system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques. Another highlight of our system is its pluggable processing framework, allowing seamless integration of embedding models, graph indexes, and LLMs. This flexibility provides users diverse options for gaining insights from their multi-modal knowledge base. A preliminary video introduction of MQA is available at //youtu.be/xvUuo2ZIqWk.
This paper introduces a new wheel-legged robot and develops motion controllers based on central pattern generators (CPGs) for the robot to navigate over a range of terrains. A transformable leg-wheel design is considered and characterized in terms of key locomotion characteristics as a function of the design. Kinematic analysis is conducted based on a generalized four-bar mechanism driven by a coaxial hub arrangement. The analysis is used to inform the design of a central pattern generator to control the robot by mapping oscillator states to wheel-leg trajectories and implementing differential steering within the oscillator network. Three oscillator models are used as the basis of the CPGs, and their performance is compared over a range of inputs. The CPG-based controller is used to drive the developed robot prototype on level ground and over obstacles. Additional simulated tests are performed for uneven terrain negotiation and obstacle climbing. Results demonstrate the effectiveness of CPG control in transformable wheel-legged robots.
Electroencephalogram (EEG) technology, particularly high-density EEG (HD EEG) devices, is widely used in fields such as neuroscience. HD EEG devices improve the spatial resolution of EEG by placing more electrodes on the scalp, meeting the requirements of clinical diagnostic applications such as epilepsy focus localization. However, this technique faces challenges such as high acquisition costs and limited usage scenarios. In this paper, spatio-temporal adaptive diffusion models (STADMs) are proposed to pioneer the use of diffusion models for achieving spatial SR reconstruction from low-resolution (LR, 64 channels or fewer) EEG to high-resolution (HR, 256 channels) EEG. Specifically, a spatio-temporal condition module is designed to extract the spatio-temporal features of LR EEG, which then serve as conditional inputs to guide the reverse denoising process of diffusion models. Additionally, a multi-scale Transformer denoising module is constructed to leverage multi-scale convolution blocks and cross-attention-based diffusion Transformer blocks for conditional guidance to generate subject-adaptive SR EEG. Experimental results demonstrate that the proposed method effectively enhances the spatial resolution of LR EEG and quantitatively outperforms existing methods. Furthermore, STADMs demonstrate their value by applying synthetic SR EEG to classification and source localization tasks of epilepsy patients, indicating their potential to significantly improve the spatial resolution of LR EEG.
In a recent work, Gryaznov, Pudl\'{a}k, and Talebanfard (CCC' 22) introduced a stronger version of affine extractors known as directional affine extractors, together with a generalization of $\mathsf{ROBP}$s where each node can make linear queries, and showed that the former implies strong lower bound for a certain type of the latter known as strongly read-once linear branching programs ($\mathsf{SROLBP}$s). Their main result gives explicit constructions of directional affine extractors for entropy $k > 2n/3$, which implies average-case complexity $2^{n/3-o(n)}$ against $\mathsf{SROLBP}$s with exponentially small correlation. A follow-up work by Chattopadhyay and Liao (ECCC' 22) improves the hardness to $2^{n-o(n)}$ at the price of increasing the correlation to polynomially large. In this paper we show: An explicit construction of directional affine extractors with $k=o(n)$ and exponentially small error, which gives average-case complexity $2^{n-o(n)}$ against $\mathsf{SROLBP}$s with exponentially small correlation, thus answering the two open questions raised in previous works. An explicit function in $\mathsf{AC}^0$ that gives average-case complexity $2^{(1-\delta)n}$ against $\mathsf{ROBP}$s with negligible correlation, for any constant $\delta>0$. Previously, no such average-case hardness is known, and the best size lower bound for any function in $\mathsf{AC}^0$ against $\mathsf{ROBP}$s is $2^{\Omega(n)}$. One of the key ingredients in our constructions is a new linear somewhere condenser for affine sources, which is based on dimension expanders. The condenser also leads to an unconditional improvement of the entropy requirement of explicit affine extractors with negligible error. We further show that the condenser also works for general weak random sources, under the Polynomial Freiman-Ruzsa Theorem in $\mathsf{F}_2^n$.
While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50\% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.
With the emergence of the Software 3.0 era, there is a growing trend of compressing and integrating large models into software systems, with significant societal implications. Regrettably, in numerous instances, model compression techniques impact the fairness performance of these models and thus the ethical behavior of DNN-powered software. One of the most notable example is the Lottery Ticket Hypothesis (LTH), a prevailing model pruning approach. This paper demonstrates that fairness issue of LTHbased pruning arises from both its subnetwork selection and training procedures, highlighting the inadequacy of existing remedies. To address this, we propose a novel pruning framework, Ballot, which employs a novel conflict-detection-based subnetwork selection to find accurate and fair subnetworks, coupled with a refined training process to attain a high-performance model, thereby improving the fairness of DNN-powered software. By means of this procedure, Ballot improves the fairness of pruning by 38.00%, 33.91%, 17.96%, and 35.82% compared to state-of-the-art baselines, namely Magnitude Pruning, Standard LTH, SafeCompress, and FairScratch respectively, based on our evaluation of five popular datasets and three widely used models. Our code is available at //anonymous.4open.science/r/Ballot-506E.
In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for latent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.
Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.