Ultrasound imaging serves as a pivotal tool for diagnosing cervical lymph node lesions. However, the diagnoses of these images largely hinge on the expertise of medical practitioners, rendering the process susceptible to misdiagnoses. Although rapidly developing deep learning has substantially improved the diagnoses of diverse ultrasound images, there remains a conspicuous research gap concerning cervical lymph nodes. The objective of our work is to accurately diagnose cervical lymph node lesions by leveraging a deep learning model. To this end, we first collected 3392 images containing normal lymph nodes, benign lymph node lesions, malignant primary lymph node lesions, and malignant metastatic lymph node lesions. Given that ultrasound images are generated by the reflection and scattering of sound waves across varied bodily tissues, we proposed the Conv-FFT Block. It integrates convolutional operations with the fast Fourier transform to more astutely model the images. Building upon this foundation, we designed a novel architecture, named US-SFNet. This architecture not only discerns variances in ultrasound images from the spatial domain but also adeptly captures microstructural alterations across various lesions in the frequency domain. To ascertain the potential of US-SFNet, we benchmarked it against 12 popular architectures through five-fold cross-validation. The results show that US-SFNet is SOTA and can achieve 92.89% accuracy, 90.46% precision, 89.95% sensitivity and 97.49% specificity, respectively.
Tongue segmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, when confronted with tongue images that differ from the training set or possess challenging backgrounds, these methods demonstrate limited performance. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation leverages its learned prior knowledge from natural images, enabling the achievement of zero-shot segmentation for various types of tongue images. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. Even when dealing with challenging background tongue images, TongueSAM achieves a mIoU of 95.23\% under zero-shot conditions, surpassing other segmentation methods. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model will be made public when the paper is accepted.
Medical image segmentation aims to delineate the anatomical or pathological structures of interest, playing a crucial role in clinical diagnosis. A substantial amount of high-quality annotated data is crucial for constructing high-precision deep segmentation models. However, medical annotation is highly cumbersome and time-consuming, especially for medical videos or 3D volumes, due to the huge labeling space and poor inter-frame consistency. Recently, a fundamental task named Moving Object Segmentation (MOS) has made significant advancements in natural images. Its objective is to delineate moving objects from the background within image sequences, requiring only minimal annotations. In this paper, we propose the first foundation model, named iMOS, for MOS in medical images. Extensive experiments on a large multi-modal medical dataset validate the effectiveness of the proposed iMOS. Specifically, with the annotation of only a small number of images in the sequence, iMOS can achieve satisfactory tracking and segmentation performance of moving objects throughout the entire sequence in bi-directions. We hope that the proposed iMOS can help accelerate the annotation speed of experts, and boost the development of medical foundation models.
Deep neural networks have become the method of choice for solving many image classification tasks, largely because they can fit very complex functions defined over raw images. The downside of such powerful learners is the danger of overfitting the training set, leading to poor generalization, which is usually avoided by regularization and "early stopping" of the training. In this paper, we propose a new deep network ensemble classifier that is very effective against overfit. We begin with the theoretical analysis of a regression model, whose predictions - that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method designed to combat overfit, where the prediction is determined by the most consensual prediction throughout the training. On multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement, and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. Accordingly, it is a practical and useful tool to overcome overfit.
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to $3.8\times$, $2.6\times$, and $1.7\times$, respectively.
The forthcoming 6G systems are expected to address a wide range of non-stationary tasks. This poses challenges to traditional medium access control (MAC) protocols that are static and predefined. In response, data-driven MAC protocols have recently emerged, offering ability to tailor their signaling messages for specific tasks. This article presents a novel categorization of these data-driven MAC protocols into three levels: Level 1 MAC. task-oriented neural protocols constructed using multi-agent deep reinforcement learning (MADRL); Level 2 MAC. neural network-oriented symbolic protocols developed by converting Level 1 MAC outputs into explicit symbols; and Level 3 MAC. language-oriented semantic protocols harnessing large language models (LLMs) and generative models. With this categorization, we aim to explore the opportunities and challenges of each level by delving into their foundational techniques. Drawing from information theory and associated principles as well as selected case studies, this study provides insights into the trajectory of data-driven MAC protocols and sheds light on future research directions.
Recently, learned image compression has achieved remarkable performance. The entropy model, which estimates the distribution of the latent representation, plays a crucial role in boosting rate-distortion performance. However, most entropy models only capture correlations in one dimension, while the latent representation contain channel-wise, local spatial, and global spatial correlations. To tackle this issue, we propose the Multi-Reference Entropy Model (MEM) and the advanced version, MEM$^+$. These models capture the different types of correlations present in latent representation. Specifically, We first divide the latent representation into slices. When decoding the current slice, we use previously decoded slices as context and employ the attention map of the previously decoded slice to predict global correlations in the current slice. To capture local contexts, we introduce two enhanced checkerboard context capturing techniques that avoids performance degradation. Based on MEM and MEM$^+$, we propose image compression models MLIC and MLIC$^+$. Extensive experimental evaluations demonstrate that our MLIC and MLIC$^+$ models achieve state-of-the-art performance, reducing BD-rate by $8.05\%$ and $11.39\%$ on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Our code will be available at //github.com/JiangWeibeta/MLIC.
Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at //yanjieze.com/H-InDex .
Pancreatic diseases are difficult to treat with high doses of radiation, as they often present both periodic and aperiodic deformations. Nevertheless, we expect that these difficulties can be overcome, and treatment results may be improved with the practical use of a device that can capture 2D slices of organs during irradiation. However, since only a few 2D slices can be taken, the 3D motion needs to be estimated from partially observed information. In this study, we propose a physics-based framework for estimating the 3D motion of organs, regardless of periodicity, from motion information obtained by 2D slices in one or more directions and a regression model that estimates the accuracy of the proposed framework to select the optimal slice. Using information obtained by slice-to-slice registration and setting the surrounding organs as boundaries, the framework drives the physical models for estimating 3D motion. The R2 score of the proposed regression model was greater than 0.9, and the RMSE was 0.357 mm. The mean errors were 5.11 $\pm$ 1.09 mm using an axial slice and 2.13 $\pm$ 0.598 mm using concurrent axial, sagittal, and coronal slices. Our results suggest that the proposed framework is comparable to volume-to-volume registration, and is feasible.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.
Object tracking is challenging as target objects often undergo drastic appearance changes over time. Recently, adaptive correlation filters have been successfully applied to object tracking. However, tracking algorithms relying on highly adaptive correlation filters are prone to drift due to noisy updates. Moreover, as these algorithms do not maintain long-term memory of target appearance, they cannot recover from tracking failures caused by heavy occlusion or target disappearance in the camera view. In this paper, we propose to learn multiple adaptive correlation filters with both long-term and short-term memory of target appearance for robust object tracking. First, we learn a kernelized correlation filter with an aggressive learning rate for locating target objects precisely. We take into account the appropriate size of surrounding context and the feature representations. Second, we learn a correlation filter over a feature pyramid centered at the estimated target position for predicting scale changes. Third, we learn a complementary correlation filter with a conservative learning rate to maintain long-term memory of target appearance. We use the output responses of this long-term filter to determine if tracking failure occurs. In the case of tracking failures, we apply an incrementally learned detector to recover the target position in a sliding window fashion. Extensive experimental results on large-scale benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods in terms of efficiency, accuracy, and robustness.