{mayi_des}
Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. We develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for identification, achieving over 75% in Token F$_1$. We further conduct a HCI user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction. We experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. Our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.
In this paper, we raise a new issue on Unidentified Foreground Object (UFO) detection in 3D point clouds, which is a crucial technology in autonomous driving in the wild. UFO detection is challenging in that existing 3D object detectors encounter extremely hard challenges in both 3D localization and Out-of-Distribution (OOD) detection. To tackle these challenges, we suggest a new UFO detection framework including three tasks: evaluation protocol, methodology, and benchmark. The evaluation includes a new approach to measure the performance on our goal, i.e. both localization and OOD detection of UFOs. The methodology includes practical techniques to enhance the performance of our goal. The benchmark is composed of the KITTI Misc benchmark and our additional synthetic benchmark for modeling a more diverse range of UFOs. The proposed framework consistently enhances performance by a large margin across all four baseline detectors: SECOND, PointPillars, PV-RCNN, and PartA2, giving insight for future work on UFO detection in the wild.
The performance of local feature descriptors degrades in the presence of large rotation variations. To address this issue, we present an efficient approach to learning rotation invariant descriptors. Specifically, we propose Rotated Kernel Fusion (RKF) which imposes rotations on the convolution kernel to improve the inherent nature of CNN. Since RKF can be processed by the subsequent re-parameterization, no extra computational costs will be introduced in the inference stage. Moreover, we present Multi-oriented Feature Aggregation (MOFA) which aggregates features extracted from multiple rotated versions of the input image and can provide auxiliary knowledge for the training of RKF by leveraging the distillation strategy. We refer to the distilled RKF model as DRKF. Besides the evaluation on a rotation-augmented version of the public dataset HPatches, we also contribute a new dataset named DiverseBEV which is collected during the drone's flight and consists of bird's eye view images with large viewpoint changes and camera rotations. Extensive experiments show that our method can outperform other state-of-the-art techniques when exposed to large rotation variations.
In this paper we show how tensor networks help in developing explainability of machine learning algorithms. Specifically, we develop an unsupervised clustering algorithm based on Matrix Product States (MPS) and apply it in the context of a real use-case of adversary-generated threat intelligence. Our investigation proves that MPS rival traditional deep learning models such as autoencoders and GANs in terms of performance, while providing much richer model interpretability. Our approach naturally facilitates the extraction of feature-wise probabilities, Von Neumann Entropy, and mutual information, offering a compelling narrative for classification of anomalies and fostering an unprecedented level of transparency and interpretability, something fundamental to understand the rationale behind artificial intelligence decisions.
In this paper, we provide a novel enumeration algorithm for the set of all walks of a given length within a directed graph. Our algorithm has worst-case constant delay between outputting succinct representations of such walks, after a preprocessing step requiring linear time relative to the size of the graph. We apply these results to the problem of enumerating succinct representations of the strings of a given length from a prefix-closed regular language (languages accepted by a finite automaton which has final states only).
In this paper, we study the technical problem of developing conversational agents that can quickly adapt to unseen tasks, learn task-specific communication tactics, and help listeners finish complex, temporally extended tasks. We find that the uncertainty of language learning can be decomposed to an entropy term and a mutual information term, corresponding to the structural and functional aspect of language, respectively. Combined with reinforcement learning, our method automatically requests human samples for training when adapting to new tasks and learns communication protocols that are succinct and helpful for task completion. Human and simulation test results on a referential game and a 3D navigation game prove the effectiveness of the proposed method.
In this paper, we provide the first convergence guarantee for the factorization approach. Specifically, to avoid the scaling ambiguity and to facilitate theoretical analysis, we optimize over the so-called left-orthogonal TT format which enforces orthonormality among most of the factors. To ensure the orthonormal structure, we utilize the Riemannian gradient descent (RGD) for optimizing those factors over the Stiefel manifold. We first delve into the TT factorization problem and establish the local linear convergence of RGD. Notably, the rate of convergence only experiences a linear decline as the tensor order increases. We then study the sensing problem that aims to recover a TT format tensor from linear measurements. Assuming the sensing operator satisfies the restricted isometry property (RIP), we show that with a proper initialization, which could be obtained through spectral initialization, RGD also converges to the ground-truth tensor at a linear rate. Furthermore, we expand our analysis to encompass scenarios involving Gaussian noise in the measurements. We prove that RGD can reliably recover the ground truth at a linear rate, with the recovery error exhibiting only polynomial growth in relation to the tensor order. We conduct various experiments to validate our theoretical findings.
In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously.
In this paper, we study Discretized Neural Networks (DNNs) composed of low-precision weights and activations, which suffer from either infinite or zero gradients due to the non-differentiable discrete function during training. Most training-based DNNs in such scenarios employ the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete values. However, the use of STE introduces the problem of gradient mismatch, arising from perturbations in the approximated gradient. To address this problem, this paper reveals that this mismatch can be interpreted as a metric perturbation in a Riemannian manifold, viewed through the lens of duality theory. Building on information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold for DNNs, providing a background for addressing perturbations. By introducing a partial differential equation on metrics, i.e., the Ricci flow, we establish the dynamical stability and convergence of the LNE metric with the $L^2$-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods.
In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at //github.com/Xuehao-Gao/GUESS.
In this paper, we adopt 3D Convolutional Neural Networks to segment volumetric medical images. Although deep neural networks have been proven to be very effective on many 2D vision tasks, it is still challenging to apply them to 3D tasks due to the limited amount of annotated 3D data and limited computational resources. We propose a novel 3D-based coarse-to-fine framework to effectively and efficiently tackle these challenges. The proposed 3D-based framework outperforms the 2D counterpart to a large margin since it can leverage the rich spatial infor- mation along all three axes. We conduct experiments on two datasets which include healthy and pathological pancreases respectively, and achieve the current state-of-the-art in terms of Dice-S{\o}rensen Coefficient (DSC). On the NIH pancreas segmentation dataset, we outperform the previous best by an average of over 2%, and the worst case is improved by 7% to reach almost 70%, which indicates the reliability of our framework in clinical applications.