We present Ball-AR, an augmented reality (AR) game where two players in the same physical space attempt to hit each other with virtual dodgeballs overlaid on the physical world. Researchers have studied AR's potential for fostering co-located interaction and physical activity; however, they have not investigated the impacts of physical activity and physical environment on user experiences and interaction. We created an AR dodgeball game centered around encouraging physical activity and harnessing the physical environment. We then evaluated the game with five dyads to analyze the impacts of these design choices on the quality of gameplay and interaction between players. We found that physical activity and the shared physical space created memorable experiences and interactions among participants, although participants desired a more augmented and immersive experience.
In a two-player zero-sum graph game, the players move a token throughout a graph to produce an infinite play, which determines the winner of the game. \emph{Bidding games} are graph games in which in each turn, an auction (bidding) determines which player moves the token: the players have budgets, and in each turn, both players simultaneously submit bids that do not exceed their available budgets, the higher bidder moves the token, and pays the bid to the lower bidder (called {\em Richman} bidding). We focus on {\em discrete}-bidding games, in which, motivated by practical applications, the granularity of the players' bids is restricted, e.g., bids must be given in cents. A central quantity in bidding games is are {\em threshold budgets}: a necessary and sufficient initial budget for winning the game. Previously, thresholds were shown to exist in parity games, but their structure was only understood for reachability games. Moreover, the previously-known algorithms have a worst-case exponential running time for both reachability and parity objectives, and output strategies that use exponential memory. We describe two algorithms for finding threshold budgets in parity discrete-bidding games. The first is a fixed-point algorithm. It reveals, for the first time, the structure of threshold budgets in parity discrete-bidding games. Based on this structure, we develop a second algorithm that shows that the problem of finding threshold budgets is in \NP and co\NP for both reachability and parity objectives. Moreover, our algorithm constructs strategies that use only linear memory.
Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.
Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive details as well as larger components in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject related to the question, declining up to $45.91\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. To scale up the usefulness of human cropping, we propose ViCrop, a general framework that utilizes automatic visual cropping to enhance zero-shot VQA of MLLMs. We construct five variants of ViCrop leveraging either external localization models or the decision process of the given MLLM itself. Our results show that ViCrop improves MLLMs' zero-shot accuracy across different VQA datasets, for example, enhances BLIP2-T5's performance by $32.23\%$ on the TextVQA test set. To facilitate further investigation of MLLMs' behaviors, our code is publicly released.
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods can be roughly classified as tracking-by-detection and joint-detection-association paradigms. Although the latter has elicited more attention and demonstrates comparable performance relative than the former, we claim that the tracking-by-detection paradigm is still the optimal solution in terms of tracking accuracy,such as ByteTrack,which achieves 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU.However, under complex perspectives such as vehicle and UAV acceleration, the performance of such a tracker using uniform Kalman filter will be greatly affected, resulting in tracking loss.In this paper, we propose a variable speed Kalman filter algorithm based on environmental feedback and improve the matching process, which can greatly improve the tracking effect in complex variable speed scenes while maintaining high tracking accuracy in relatively static scenes. Eventually, higher MOTA and IDF1 results can be achieved on MOT17 test set than ByteTrack
Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: //github.com/Hasiburshanto/Bangla-Spell-Checker
As an important component of the detector localization branch, bounding box regression loss plays a significant role in object detection tasks. The existing bounding box regression methods usually consider the geometric relationship between the GT box and the predicted box, and calculate the loss by using the relative position and shape of the bounding boxes, while ignoring the influence of inherent properties such as the shape and scale of the bounding boxes on bounding box regression. In order to make up for the shortcomings of existing research, this article proposes a bounding box regression method that focuses on the shape and scale of the bounding box itself. Firstly, we analyzed the regression characteristics of the bounding boxes and found that the shape and scale factors of the bounding boxes themselves will have an impact on the regression results. Based on the above conclusions, we propose the Shape IoU method, which can calculate the loss by focusing on the shape and scale of the bounding box itself, thereby making the bounding box regression more accurate. Finally, we validated our method through a large number of comparative experiments, which showed that our method can effectively improve detection performance and outperform existing methods, achieving state-of-the-art performance in different detection tasks.Code is available at //github.com/malagoutou/Shape-IoU
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at //voyager.minedojo.org/.
Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm. With large quantities of parameters, PLMs are computation-intensive and resource-hungry. Hence, model pruning has been introduced to compress large-scale PLMs. However, most prior approaches only consider task-specific knowledge towards downstream tasks, but ignore the essential task-agnostic knowledge during pruning, which may cause catastrophic forgetting problem and lead to poor generalization ability. To maintain both task-agnostic and task-specific knowledge in our pruned model, we propose ContrAstive Pruning (CAP) under the paradigm of pre-training and fine-tuning. It is designed as a general framework, compatible with both structured and unstructured pruning. Unified in contrastive learning, CAP enables the pruned model to learn from the pre-trained model for task-agnostic knowledge, and fine-tuned model for task-specific knowledge. Besides, to better retain the performance of the pruned model, the snapshots (i.e., the intermediate models at each pruning iteration) also serve as effective supervisions for pruning. Our extensive experiments show that adopting CAP consistently yields significant improvements, especially in extremely high sparsity scenarios. With only 3% model parameters reserved (i.e., 97% sparsity), CAP successfully achieves 99.2% and 96.3% of the original BERT performance in QQP and MNLI tasks. In addition, our probing experiments demonstrate that the model pruned by CAP tends to achieve better generalization ability.
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.
Multi-agent influence diagrams (MAIDs) are a popular form of graphical model that, for certain classes of games, have been shown to offer key complexity and explainability advantages over traditional extensive form game (EFG) representations. In this paper, we extend previous work on MAIDs by introducing the concept of a MAID subgame, as well as subgame perfect and trembling hand perfect equilibrium refinements. We then prove several equivalence results between MAIDs and EFGs. Finally, we describe an open source implementation for reasoning about MAIDs and computing their equilibria.