The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.
As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.
Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations, but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework, which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition, we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer, respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, fine-tuning language models, amongst many. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a `strong reference' which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected -- all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods. Empirical evaluations corroborate our findings and outperform the existing baselines.
This study aims to investigate the comprehensive characterization of information content in multimedia (videos), particularly on YouTube. The research presents a multi-method framework for characterizing multimedia content by clustering signals from various modalities, such as audio, video, and text. With a focus on South China Sea videos as a case study, this approach aims to enhance our understanding of online content, especially on YouTube. The dataset includes 160 videos, and our findings offer insights into content themes and patterns within different modalities of a video based on clusters. Text modality analysis revealed topical themes related to geopolitical countries, strategies, and global security, while video and audio modality analysis identified distinct patterns of signals related to diverse sets of videos, including news analysis/reporting, educational content, and interviews. Furthermore, our findings uncover instances of content repurposing within video clusters, which were identified using the barcode technique and audio similarity assessments. These findings indicate potential content amplification techniques. In conclusion, this study uniquely enhances our current understanding of multimedia content information based on modality clustering techniques.
Social media platforms such as Twitter have a fundamental role in facilitating the spread and discussion of ideas online through the concept of retweeting and replying. However, these features also contribute to the spread of mis/disinformation during the vaccine rollout of the COVID-19 pandemic. Using COVID-19 vaccines as a case study, we analyse multiple social network representation derived from three message-based interactions on Twitter (quote retweets, mentions and replies) based upon a set of known anti-vax hashtags and keywords. Each network represents a certain hashtag or keyword which were labelled as "controversial" and "non-controversial" according to a small group of participants. For each network, we extract a combination of global and local network-based metrics which are used as feature vectors for binary classification. Our results suggest that it is possible to detect controversial from non-controversial terms with high accuracy using simple network-based metrics. Furthermore, these results demonstrate the potential of network representations as language-agnostic models for detecting mis/disinformation at scale, irrespective of content and across multiple social media platforms.
In response to growing concerns about user privacy, legislators have introduced new regulations and laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) that force websites to obtain user consent before activating personal data collection, fundamental to providing targeted advertising. The cornerstone of this consent-seeking process involves the use of Privacy Banners, the technical mechanism to collect users' approval for data collection practices. Consent management platforms (CMPs) have emerged as practical solutions to make it easier for website administrators to properly manage consent, allowing them to outsource the complexities of managing user consent and activating advertising features. This paper presents a detailed and longitudinal analysis of the evolution of CMPs spanning nine years. We take a twofold perspective: Firstly, thanks to the HTTP Archive dataset, we provide insights into the growth, market share, and geographical spread of CMPs. Noteworthy observations include the substantial impact of GDPR on the proliferation of CMPs in Europe. Secondly, we analyse millions of user interactions with a medium-sized CMP present in thousands of websites worldwide. We observe how even small changes in the design of Privacy Banners have a critical impact on the user's giving or denying their consent to data collection. For instance, over 60% of users do not consent when offered a simple "one-click reject-all" option. Conversely, when opting out requires more than one click, about 90% of users prefer to simply give their consent. The main objective is in fact to eliminate the annoying privacy banner rather the make an informed decision. Curiously, we observe iOS users exhibit a higher tendency to accept cookies compared to Android users, possibly indicating greater confidence in the privacy offered by Apple devices.
Progressing digitalization and increasing demand and use of software cause rises in energy- and resource consumption from information and communication technologies (ICT). This raises the issue of sustainability in ICT, which increasingly includes the sustainability of the software products themselves and the art of creating sustainable software. To this end, we conducted an analysis to gather and present existing literature on three research questions relating to the production of ecologically sustainable software ("Green Coding") and to provide orientation for stakeholders approaching the subject. We compile the approaches to Green Coding and Green Software Engineering (GSE) that have been published since 2010. Furthermore, we considered ways to integrate the findings into existing industrial processes and higher education curricula to influence future development in an environmentally friendly way.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.