Human landing, exploration and settlement on Mars will require local compute resources at the Mars edge. Landing such resources on Mars is an expensive endeavor. Instead, in this paper we lay out how concepts from low-Earth orbit edge computing may be applied to Mars edge computing. This could lower launching costs of compute resources for Mars while also providing Mars-wide networking and compute coverage. We propose a possible Mars compute constellation, discuss applications, analyze feasibility, and raise research questions for future work.
Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Our code is available at //github.com/tml-epfl/why-weight-decay.
Data protection regulations, such as GDPR and CCPA, require websites and embedded third-parties, especially advertisers, to seek user consent before they can collect and process user data. Only when the users opt in, can these entities collect, process, and share user data. Websites typically incorporate Consent Management Platforms (CMPs), such as OneTrust and CookieBot, to solicit and convey user consent to the embedded advertisers, with the expectation that the consent will be respected. However, neither the websites nor the regulators currently have any mechanism to audit advertisers' compliance with the user consent, i.e., to determine if advertisers indeed do not collect, process, and share user data when the user opts out. In this paper, we propose an auditing framework that leverages advertisers' bidding behavior to empirically assess the violations of data protection regulations. Using our framework, we conduct a measurement study to evaluate four of the most widely deployed CMPs, i.e., Didomi, Quantcast, OneTrust, and CookieBot, as well as advertiser-offered opt-out controls, i.e., National Advertising Initiative's opt-out, under GDPR and CCPA. Our results indicate that in many cases user data is unfortunately still being collected, processed, and shared even when users opt-out. We also find that some CMPs are better than the others at conveying user consent and that several ad platforms ignore user consent. Our results also indicate that advertiser-offered opt-out are equally ineffective at protecting user privacy.
Over-the-Air (OtA) Federated Learning (FL) refers to an FL system where multiple agents apply OtA computation for transmitting model updates to a common edge server. Two important features of OtA computation, namely linear processing and signal-level superposition, motivate the use of linear compression with compressed sensing (CS) methods to reduce the number of data samples transmitted over the channel. The previous works on applying CS methods in OtA FL have primarily assumed that the original model update vectors are sparse, or they have been sparsified before compression. However, it is unclear whether linear compression with CS-based reconstruction is more effective than directly sending the non-zero elements in the sparsified update vectors, under the same total power constraint. In this study, we examine and compare several communication designs with or without sparsification. Our findings demonstrate that sparsification before compression is not necessary. Alternatively, sparsification without linear compression can also achieve better performance than the commonly considered setup that combines both.
Metaverse as-a-Service (MaaS) enables Metaverse tenants to execute their APPlications (MetaAPP) by allocating Metaverse resources in the form of Metaverse service functions (MSF). Usually, each MSF is deployed in a virtual machine (VM) for better resiliency and security. However, these MSFs along with VMs and virtual machine monitors (VMM) running them may encounter software aging after prolonged continuous operation. Then, there is a decrease in MetaAPP dependability, namely, the dependability of the MSF chain (MSFC), consisting of MSFs allocated to MetaAPP. This paper aims to investigate the impact of both software aging and rejuvenation techniques on MetaAPP dependability in the scenarios, where both active components (MSF, VM and VMM) and their backup components are subject to software aging. We develop a hierarchical model to capture behaviors of aging, failure, and recovery by applying Semi-Markov process and reliability block diagram. Numerical analysis and simulation experiments are conducted to evaluate the approximation accuracy of the proposed model and dependability metrics. We then identify the key parameters for improving the MetaAPP/MSFC dependability through sensitivity analysis. The investigation is also made about the influence of various parameters on MetaAPP/MSFC dependability.
Recently there have been many algorithms proposed for the classification of very high resolution whole slide images (WSIs). These new algorithms are mostly focused on finding novel ways to combine the information from small local patches extracted from the slide, with an emphasis on effectively aggregating more global information for the final predictor. In this paper we thoroughly explore different key design choices for WSI classification algorithms to investigate what matters most for achieving high accuracy. Surprisingly, we found that capturing global context information does not necessarily mean better performance. A model that captures the most global information consistently performs worse than a model that captures less global information. In addition, a very simple multi-instance learning method that captures no global information performs almost as well as models that capture a lot of global information. These results suggest that the most important features for effective WSI classification are captured at the local small patch level, where cell and tissue micro-environment detail is most pronounced. Another surprising finding was that unsupervised pre-training on a larger set of 33 cancers gives significantly worse performance compared to pre-training on a smaller dataset of 7 cancers (including the target cancer). We posit that pre-training on a smaller, more focused dataset allows the feature extractor to make better use of the limited feature space to better discriminate between subtle differences in the input patch.
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at //github.com/XuZhengzhuo/LiVT.
Feature attribution methods are popular in interpretable machine learning. These methods compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of "attribution", leading to many competing methods with little systematic evaluation, complicated in particular by the lack of ground truth attribution. To address this, we propose a dataset modification procedure to induce such ground truth. Using this procedure, we evaluate three common methods: saliency maps, rationales, and attentions. We identify several deficiencies and add new perspectives to the growing body of evidence questioning the correctness and reliability of these methods applied on datasets in the wild. We further discuss possible avenues for remedy and recommend new attribution methods to be tested against ground truth before deployment. The code is available at \url{//github.com/YilunZhou/feature-attribution-evaluation}.
Non-convex optimization is ubiquitous in modern machine learning. Researchers devise non-convex objective functions and optimize them using off-the-shelf optimizers such as stochastic gradient descent and its variants, which leverage the local geometry and update iteratively. Even though solving non-convex functions is NP-hard in the worst case, the optimization quality in practice is often not an issue -- optimizers are largely believed to find approximate global minima. Researchers hypothesize a unified explanation for this intriguing phenomenon: most of the local minima of the practically-used objectives are approximately global minima. We rigorously formalize it for concrete instances of machine learning problems.
Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the $\ell_1$-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron's gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolution layer.
In recent years, DBpedia, Freebase, OpenCyc, Wikidata, and YAGO have been published as noteworthy large, cross-domain, and freely available knowledge graphs. Although extensively in use, these knowledge graphs are hard to compare against each other in a given setting. Thus, it is a challenge for researchers and developers to pick the best knowledge graph for their individual needs. In our recent survey, we devised and applied data quality criteria to the above-mentioned knowledge graphs. Furthermore, we proposed a framework for finding the most suitable knowledge graph for a given setting. With this paper we intend to ease the access to our in-depth survey by presenting simplified rules that map individual data quality requirements to specific knowledge graphs. However, this paper does not intend to replace our previously introduced decision-support framework. For an informed decision on which KG is best for you we still refer to our in-depth survey.