In this paper, we introduce a novel swarm application, swarm synergy, where robots in a swarm intend to form communities. Each robot is considered to make independent decisions without any communication capability (silent agent). The proposed algorithm is based on parameters local to individual robots. Engaging scenarios are studied where the silent robots form communities without the preset conditions on the number of communities, community size, goal location of each community, and specific members in the community. Our approach allows silent robots to achieve this self-organized swarm behavior using only sensory inputs from the environment. The algorithm facilitates the formation of multiple swarm communities at arbitrary locations with unspecified goal locations. We further infer the behavior of swarm synergy to ensure the anonymity/untraceability of both robots and communities. The robots intend to form a community by sensing the neighbors, creating synergy in a bounded environment. The time to achieve synergy depends on the environment boundary and the onboard sensor's field of view. Compared to the state-of-art with similar objectives, the proposed communication-free swarm synergy shows comparative time to synergize with untraceability features.
In this paper, we present an approach to automated solving of triangle ruler-and-compass construction problems using finite-domain constraint solvers. The constraint model is described in the MiniZinc modeling language, and is based on the automated planning. The main benefit of using general constraint solvers for such purpose, instead of developing dedicated tools, is that we can rely on the efficient search that is already implemented within the solver, enabling us to focus on geometric aspects of the problem. We may also use the solver's built-in optimization capabilities to search for the shortest possible constructions. We evaluate our approach on 74 solvable problems from the Wernick's list, and compare it to the dedicated triangle construction solver ArgoTriCS. The results show that our approach is comparable to dedicated tools, while it requires much less effort to implement. Also, our model often finds shorter constructions, thanks to the optimization capabilities offered by the constraint solvers.
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at //github.com/LiheYoung/Depth-Anything.
In this paper, we study the lossless analog compression for i.i.d. nonsingular signals via the polarization-based framework. We prove that for nonsingular source, the error probability of maximum a posteriori (MAP) estimation polarizes under the Hadamard transform, which extends the polarization phenomenon to analog domain. Building on this insight, we propose partial Hadamard compression and develop the corresponding analog successive cancellation (SC) decoder. The proposed scheme consists of deterministic measurement matrices and non-iterative reconstruction algorithm, providing benefits in both space and computational complexity. Using the polarization of error probability, we prove that our approach achieves the information-theoretical limit for lossless analog compression developed by Wu and Verdu.
In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture the decreasing marginal reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$. Lastly, we numerically test our algorithm by comparing it to alternative methods.
Purpose: In this paper, we present a novel approach to the automatic evaluation of open surgery skills using depth cameras. This work is intended to show that depth cameras achieve similar results to RGB cameras, which is the common method in the automatic evaluation of open surgery skills. Moreover, depth cameras offer advantages such as robustness to lighting variations, camera positioning, simplified data compression, and enhanced privacy, making them a promising alternative to RGB cameras. Methods: Experts and novice surgeons completed two simulators of open suturing. We focused on hand and tool detection, and action segmentation in suturing procedures. YOLOv8 was used for tool detection in RGB and depth videos. Furthermore, UVAST and MSTCN++ were used for action segmentation. Our study includes the collection and annotation of a dataset recorded with Azure Kinect. Results: We demonstrated that using depth cameras in object detection and action segmentation achieves comparable results to RGB cameras. Furthermore, we analyzed 3D hand path length, revealing significant differences between experts and novice surgeons, emphasizing the potential of depth cameras in capturing surgical skills. We also investigated the influence of camera angles on measurement accuracy, highlighting the advantages of 3D cameras in providing a more accurate representation of hand movements. Conclusion: Our research contributes to advancing the field of surgical skill assessment by leveraging depth cameras for more reliable and privacy evaluations. The findings suggest that depth cameras can be valuable in assessing surgical skills and provide a foundation for future research in this area.
We present VeriX, a first step towards verified explainability of machine learning models in safety-critical applications. Specifically, our sound and optimal explanations can guarantee prediction invariance against bounded perturbations. We utilise constraint solving techniques together with feature sensitivity ranking to efficiently compute these explanations. We evaluate our approach on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.
In this paper, we introduce a two-level attention schema, Poolingformer, for long document modeling. Its first level uses a smaller sliding window pattern to aggregate information from neighbors. Its second level employs a larger window to increase receptive fields with pooling attention to reduce both computational cost and memory consumption. We first evaluate Poolingformer on two long sequence QA tasks: the monolingual NQ and the multilingual TyDi QA. Experimental results show that Poolingformer sits atop three official leaderboards measured by F1, outperforming previous state-of-the-art models by 1.9 points (79.8 vs. 77.9) on NQ long answer, 1.9 points (79.5 vs. 77.6) on TyDi QA passage answer, and 1.6 points (67.6 vs. 66.0) on TyDi QA minimal answer. We further evaluate Poolingformer on a long sequence summarization task. Experimental results on the arXiv benchmark continue to demonstrate its superior performance.
In this paper, we present a comprehensive review of the imbalance problems in object detection. To analyze the problems in a systematic manner, we introduce a problem-based taxonomy. Following this taxonomy, we discuss each problem in depth and present a unifying yet critical perspective on the solutions in the literature. In addition, we identify major open issues regarding the existing imbalance problems as well as imbalance problems that have not been discussed before. Moreover, in order to keep our review up to date, we provide an accompanying webpage which catalogs papers addressing imbalance problems, according to our problem-based taxonomy. Researchers can track newer studies on this webpage available at: //github.com/kemaloksuz/ObjectDetectionImbalance .
Salient object detection is a problem that has been considered in detail and many solutions proposed. In this paper, we argue that work to date has addressed a problem that is relatively ill-posed. Specifically, there is not universal agreement about what constitutes a salient object when multiple observers are queried. This implies that some objects are more likely to be judged salient than others, and implies a relative rank exists on salient objects. The solution presented in this paper solves this more general problem that considers relative rank, and we propose data and metrics suitable to measuring success in a relative objects saliency landscape. A novel deep learning solution is proposed based on a hierarchical representation of relative saliency and stage-wise refinement. We also show that the problem of salient object subitizing can be addressed with the same network, and our approach exceeds performance of any prior work across all metrics considered (both traditional and newly proposed).
In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmax have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW BLUFR and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available at //github.com/happynear/AMSoftmax