Satellite imagery has emerged as an important tool to analyse demographic, health, and development indicators. While various deep learning models have been built for these tasks, each is specific to a particular problem, with few standard benchmarks available. We propose a new dataset pairing satellite imagery and high-quality survey data on child poverty to benchmark satellite feature representations. Our dataset consists of 33,608 images, each 10 km $\times$ 10 km, from 19 countries in Eastern and Southern Africa in the time period 1997-2022. As defined by UNICEF, multidimensional child poverty covers six dimensions and it can be calculated from the face-to-face Demographic and Health Surveys (DHS) Program . As part of the benchmark, we test spatial as well as temporal generalization, by testing on unseen locations, and on data after the training years. Using our dataset we benchmark multiple models, from low-level satellite imagery models such as MOSAIKS , to deep learning foundation models, which include both generic vision models such as Self-Distillation with no Labels (DINOv2) models and specific satellite imagery models such as SatMAE. We provide open source code for building the satellite dataset, obtaining ground truth data from DHS and running various models assessed in our work.
ReScript introduces a strongly typed language that targets JavaScript, as an alternative to gradually typed languages, such as TypeScript. In this paper, we present a type system for data-flow analysis for a subset of the ReScript language, more specific for a lambda-calculus with mutability and pattern matching. The type system is a local analysis that collects information about what variables are used and alias information.
A partially identified model, where the parameters can not be uniquely identified, often arises during statistical analysis. While researchers frequently use Bayesian inference to analyze the models, when Bayesian inference with an off-the-shelf MCMC sampling algorithm is applied to a partially identified model, the computational performance can be poor. It is found that using importance sampling with transparent reparameterization (TP) is one remedy. This method is preferable since the model is known to be rendered as identified with respect to the new parameterization, and at the same time, it may allow faster, i.i.d. Monte Carlo sampling by using conjugate convenience priors. In this paper, we explain the importance sampling method with the TP and a pseudo-TP. We introduce the pseudo-TP, an alternative to TP, since finding a TP is sometimes difficult. Then, we test the methods' performance in some scenarios and compare it to the performance of the off-the-shelf MCMC method - Gibbs sampling - applied in the original parameterization. While the importance sampling with TP (ISTP) shows generally better results than off-the-shelf MCMC methods, as seen in the compute time and trace plots, it is also seen that finding a TP which is necessary for the method may not be easy. On the other hand, the pseudo-TP method shows a mixed result and room for improvement since it relies on an approximation, which may not be adequate for a given model and dataset.
Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a significant reduction in collisions in the final reconstructions of scenes with multiple interacting humans and objects and a more coherent scene reconstruction.
In view of the advantages of simplicity and effectiveness of the Kaczmarz method, which was originally employed to solve the large-scale system of linear equations $Ax=b$, we study the greedy randomized block Kaczmarz method (ME-GRBK) and its relaxation and deterministic versions to solve the matrix equation $AXB=C$, which is commonly encountered in the applications of engineering sciences. It is demonstrated that our algorithms converge to the unique least-norm solution of the matrix equation when it is consistent and their convergence rate is faster than that of the randomized block Kaczmarz method (ME-RBK). Moreover, the block Kaczmarz method (ME-BK) for solving the matrix equation $AXB=C$ is investigated and it is found that the ME-BK method converges to the solution $A^{+}CB^{+}+X^{0}-A^{+}AX^{0}BB^{+}$ when it is consistent. The numerical tests verify the theoretical results and the methods presented in this paper are applied to the color image restoration problem to obtain satisfactory restored images.
Detecting falls among the elderly and alerting their community responders can save countless lives. We design and develop a low-cost mobile robot that periodically searches the house for the person being monitored and sends an email to a set of designated responders if a fall is detected. In this project, we make three novel design decisions and contributions. First, our custom-designed low-cost robot has advanced features like omnidirectional wheels, the ability to run deep learning models, and autonomous wireless charging. Second, we improve the accuracy of fall detection for the YOLOv8-Pose-nano object detection network by 6% and YOLOv8-Pose-large by 12%. We do so by transforming the images captured from the robot viewpoint (camera height 0.15m from the ground) to a typical human viewpoint (1.5m above the ground) using a principally computed Homography matrix. This improves network accuracy because the training dataset MS-COCO on which YOLOv8-Pose is trained is captured from a human-height viewpoint. Lastly, we improve the robot controller by learning a model that predicts the robot velocity from the input signal to the motor controller.
Dynamic data visualizations can convey large amounts of information over time, such as using motion to depict changes in data values for multiple entities. Such dynamic displays put a demand on our visual processing capacities, yet our perception of motion is limited. Several techniques have been shown to improve the processing of dynamic displays. Staging the animation to sequentially show steps in a transition and tracing object movement by displaying trajectory histories can improve processing by reducing the cognitive load. In this paper, We examine the effectiveness of staging and tracing in dynamic displays. We showed participants animated line charts depicting the movements of lines and asked them to identify the line with the highest mean and variance. We manipulated the animation to display the lines with or without staging, tracing and history, and compared the results to a static chart as a control. Results showed that tracing and staging are preferred by participants, and improve their performance in mean and variance tasks respectively. They also preferred display time 3 times shorter when staging is used. Also, encoding animation speed with mean and variance in congruent tasks is associated with higher accuracy. These findings help inform real-world best practices for building dynamic displays. The supplementary materials can be found at //osf.io/8c95v/
Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax -- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.