Radiotherapy (RT) is a key component in the treatment of various cancers, including Acute Lymphocytic Leukemia (ALL) and Acute Myelogenous Leukemia (AML). Precise delineation of organs at risk (OARs) and target areas is essential for effective treatment planning. Intensity Modulated Radiotherapy (IMRT) techniques, such as Total Marrow Irradiation (TMI) and Total Marrow and Lymph node Irradiation (TMLI), provide more precise radiation delivery compared to Total Body Irradiation (TBI). However, these techniques require time-consuming manual segmentation of structures in Computerized Tomography (CT) scans by the Radiation Oncologist (RO). In this paper, we present a deep learning-based auto-contouring method for segmenting Planning Target Volume (PTV) for TMLI treatment using the U-Net architecture. We trained and compared two segmentation models with two different loss functions on a dataset of 100 patients treated with TMLI at the Humanitas Research Hospital between 2011 and 2021. Despite challenges in lymph node areas, the best model achieved an average Dice score of 0.816 for PTV segmentation. Our findings are a preliminary but significant step towards developing a segmentation model that has the potential to save radiation oncologists a considerable amount of time. This could allow for the treatment of more patients, resulting in improved clinical practice efficiency and more reproducible contours.
An important issue in medical image processing is to be able to estimate not only the performances of algorithms but also the precision of the estimation of these performances. Reporting precision typically amounts to reporting standard-error of the mean (SEM) or equivalently confidence intervals. However, this is rarely done in medical image segmentation studies. In this paper, we aim to estimate what is the typical confidence that can be expected in such studies. To that end, we first perform experiments for Dice metric estimation using a standard deep learning model (U-net) and a classical task from the Medical Segmentation Decathlon. We extensively study precision estimation using both Gaussian assumption and bootstrapping (which does not require any assumption on the distribution). We then perform simulations for other test set sizes and performance spreads. Overall, our work shows that small test sets lead to wide confidence intervals (e.g. $\sim$8 points of Dice for 20 samples with $\sigma \simeq 10$).
Diversification of recommendation results is a promising approach for coping with the uncertainty associated with users' information needs. Of particular importance in diversified recommendation is to define and optimize an appropriate diversity objective. In this study, we revisit the most popular diversity objective called intra-list distance (ILD), defined as the average pairwise distance between selected items, and a similar but lesser known objective called dispersion, which is the minimum pairwise distance. Owing to their simplicity and flexibility, ILD and dispersion have been used in a plethora of diversified recommendation research. Nevertheless, we do not actually know what kind of items are preferred by them. We present a critical reexamination of ILD and dispersion from theoretical and experimental perspectives. Our theoretical results reveal that these objectives have potential drawbacks: ILD may select duplicate items that are very close to each other, whereas dispersion may overlook distant item pairs. As a competitor to ILD and dispersion, we design a diversity objective called Gaussian ILD, which can interpolate between ILD and dispersion by tuning the bandwidth parameter. We verify our theoretical results by experimental results using real-world data and confirm the extreme behavior of ILD and dispersion in practice.
The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing 1) the teacher model's estimated speech and noise for enhanced-target training or 2) raw noisy speech and the teacher model's estimated noise for noisy-target training. Experimental results show that our proposed method outperforms several baselines, especially with the teacher/student inference, where predicted clean speech is derived successively through the teacher and final student models.
High-quality data is crucial for the success of machine learning, but labeling large datasets is often a time-consuming and costly process. While semi-supervised learning can help mitigate the need for labeled data, label quality remains an open issue due to ambiguity and disagreement among annotators. Thus, we use proposal-guided annotations as one option which leads to more consistency between annotators. However, proposing a label increases the probability of the annotators deciding in favor of this specific label. This introduces a bias which we can simulate and remove. We propose a new method CleverLabel for Cost-effective LabEling using Validated proposal-guidEd annotations and Repaired LABELs. CleverLabel can reduce labeling costs by up to 30.0%, while achieving a relative improvement in Kullback-Leibler divergence of up to 29.8% compared to the previous state-of-the-art on a multi-domain real-world image classification benchmark. CleverLabel offers a novel solution to the challenge of efficiently labeling large datasets while also improving the label quality.
We study the problem of multi-agent coordination in unpredictable and partially observable environments, that is, environments whose future evolution is unknown a priori and that can only be partially observed. We are motivated by the future of autonomy that involves multiple robots coordinating actions in dynamic, unstructured, and partially observable environments to complete complex tasks such as target tracking, environmental mapping, and area monitoring. Such tasks are often modeled as submodular maximization coordination problems due to the information overlap among the robots. We introduce the first submodular coordination algorithm with bandit feedback and bounded tracking regret -- bandit feedback is the robots' ability to compute in hindsight only the effect of their chosen actions, instead of all the alternative actions that they could have chosen instead, due to the partial observability; and tracking regret is the algorithm's suboptimality with respect to the optimal time-varying actions that fully know the future a priori. The bound gracefully degrades with the environments' capacity to change adversarially, quantifying how often the robots should re-select actions to learn to coordinate as if they fully knew the future a priori. The algorithm generalizes the seminal Sequential Greedy algorithm by Fisher et al. to the bandit setting, by leveraging submodularity and algorithms for the problem of tracking the best action. We validate our algorithm in simulated scenarios of multi-target tracking.
In recent years, Field-Programmable Gate Arrays (FPGA) have evolved rapidly paving the way for a whole new range of computing paradigms. On the other hand, computer applications are evolving. There is a rising demand for a system that is general-purpose and yet has the processing abilities to accommodate current trends in application processing. This work proposes a design and implementation of a tightly-coupled FPGA-based dual-processor platform. We architect a platform that optimizes the utilization of FPGA resources and allows for the investigation of practical implementation issues such as cache design. The performance of the proposed prototype is then evaluated, as different configurations of a uniprocessor and a dual-processor system are studied and compared against each other and against published results for common industry-standard CPU platforms. The proposed implementation utilizes the Nios II 32-bit embedded soft-core processor architecture designed for the Altera Cyclone III family of FPGAs.
Recent advances in logic schemes and fabrication processes have renewed interest in using superconductor electronics for energy-efficient computing and quantum control processors. However, scalable superconducting memory still poses a challenge. To address this issue, we present an alternative to approaches that solely emphasize storage cell miniaturization by exploiting the minimal attenuation and dispersion properties of superconducting passive transmission lines to develop a delay-line memory system. This fully superconducting design operates at speeds between 20 GHz and 100 GHz, with $\pm$24\% and $\pm$13\% bias margins, respectively, and demonstrates data densities in the 10s of Mbit/cm$^2$ with the MIT Lincoln Laboratory SC2 fabrication process. Additionally, the circulating nature of this design allows for minimal control circuitry, eliminates the need for data splitting and merging, and enables inexpensive implementations of sequential-access and content-addressable memories. Further advances in fabrication processes suggest data densities of 100s of Mbit/cm$^2$ and beyond.
The objective of Audio-Visual Segmentation (AVS) is to locate sounding objects within visual scenes by accurately predicting pixelwise segmentation masks. In this paper, we present the following contributions: (i), we propose a scalable and annotation-free pipeline for generating artificial data for the AVS task. We leverage existing image segmentation and audio datasets to draw links between category labels, image-mask pairs, and audio samples, which allows us to easily compose (image, audio, mask) triplets for training AVS models; (ii), we introduce a novel Audio-Aware Transformer (AuTR) architecture that features an audio-aware query-based transformer decoder. This architecture enables the model to search for sounding objects with the guidance of audio signals, resulting in more accurate segmentation; (iii), we present extensive experiments conducted on both synthetic and real datasets, which demonstrate the effectiveness of training AVS models with synthetic data generated by our proposed pipeline. Additionally, our proposed AuTR architecture exhibits superior performance and strong generalization ability on public benchmarks. The project page is //jinxiang-liu.github.io/anno-free-AVS/.
Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. Additionally, we utilize state-of-the-art visual-language representation learning techniques to extract richer and more compact features for our transformer. Our proposed approach outperforms other state-of-the-art methods on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools datasets, and we validate our introduced components with ablation studies. The source code and supplementary materials are publicly available on //www.sail-nu.com/dxformer.
Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy---using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by [16]. Our codes are publicly available at //github.com/google-research/google-research/tree/master/cluster_gcn.