In the field of explainable Artificial Intelligence (XAI), sequential counterfactual (SCF) examples are often used to alter the decision of a trained classifier by implementing a sequence of modifications to the input instance. Although certain test-time algorithms aim to optimize for each new instance individually, recently Reinforcement Learning (RL) methods have been proposed that seek to learn policies for discovering SCFs, thereby enhancing scalability. As is typical in RL, the formulation of the RL problem, including the specification of state space, actions, and rewards, can often be ambiguous. In this work, we identify shortcomings in existing methods that can result in policies with undesired properties, such as a bias towards specific actions. We propose to use the output probabilities of the classifier to create a more informative reward, to mitigate this effect.
Causal Structure Learning (CSL), amounting to extracting causal relations among the variables in a dataset, is widely perceived as an important step towards robust and transparent models. Constraint-based CSL leverages conditional independence tests to perform causal discovery. We propose Shapley-PC, a novel method to improve constraint-based CSL algorithms by using Shapley values over the possible conditioning sets to decide which variables are responsible for the observed conditional (in)dependences. We prove soundness and asymptotic consistency and demonstrate that it can outperform state-of-the-art constraint-based, search-based and functional causal model-based methods, according to standard metrics in CSL.
Small obstacles on the ground often lead to a fall when caught with commercial prosthetic feet. Despite some recently developed feet can actively control the ankle angle, for instance over slopes, their flat and rigid sole remains a cause of instability on uneven grounds. Soft robotic feet were recently proposed to tackle that issue; however, they lack consistent experimental validation. Therefore, this paper describes the experimental setup realized to test soft and rigid prosthetic feet with lower-limb prosthetic users. It includes a wooden walkway and differently shaped obstacles. It was preliminary validated with an able-bodied subject, the same subject walking on commercial prostheses through modified walking boots, and with a prosthetic user. They performed walking firstly on even ground, and secondly on even ground stepping on one of the obstacles. Results in terms of vertical ground reaction force and knee moments in both the sagittal and frontal planes show how the poor performance of commonly used prostheses is exacerbated in case of obstacles. The prosthetic user, indeed, noticeably relies on the sound leg to compensate for the stiff and unstable interaction of the prosthetic limb with the obstacle. Therefore, since the limitations of non-adaptive prosthetic feet in obstacle-dealing emerge from the experiments, as expected, this study justifies the use of the setup for investigating the performance of soft feet on uneven grounds and obstacle negotiation.
Compared to conventional semantic segmentation with pixel-level supervision, Weakly Supervised Semantic Segmentation (WSSS) with image-level labels poses the challenge that it always focuses on the most discriminative regions, resulting in a disparity between fully supervised conditions. A typical manifestation is the diminished precision on the object boundaries, leading to a deteriorated accuracy of WSSS. To alleviate this issue, we propose to adaptively partition the image content into deterministic regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing. For uncertain cues, we employ an activation-based masking strategy and seek to recover the local information with self-distilled knowledge. We further assume that the unmasked confident regions should be robust enough to preserve the global semantics. Building upon this, we introduce a complementary self-enhancement method that constrains the semantic consistency between these confident regions and an augmented image with the same class labels. Extensive experiments conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed single-stage approach for WSSS not only outperforms state-of-the-art benchmarks remarkably but also surpasses multi-stage methodologies that trade complexity for accuracy. The code can be found at \url{//github.com/Jessie459/feature-self-reinforcement}.
The uniqueness of an optimal solution to a combinatorial optimization problem attracts many fields of researchers' attention because it has a wide range of applications, it is related to important classes in computational complexity, and an instance with only one solution is often critical for algorithm designs in theory. However, as the authors know, there is no major benchmark set consisting of only instances with unique solutions, and no algorithm generating instances with unique solutions is known; a systematic approach to getting a problem instance guaranteed having a unique solution would be helpful. A possible approach is as follows: Given a problem instance, we specify a small part of a solution in advance so that only one optimal solution meets the specification. This paper formulates such a ``pre-assignment'' approach for the vertex cover problem as a typical combinatorial optimization problem and discusses its computational complexity. First, we show that the problem is $\Sigma^P_2$-complete in general, while the problem becomes NP-complete when an input graph is bipartite. We then present an $O(2.1996^n)$-time algorithm for general graphs and an $O(1.9181^n)$-time algorithm for bipartite graphs, where $n$ is the number of vertices. The latter is based on an FPT algorithm with $O^*(3.6791^{\tau})$ time for vertex cover number $\tau$. Furthermore, we show that the problem for trees can be solved in $O(1.4143^n)$ time.
Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: //github.com/yorkerlin/StructuredNGD-DL
Recent advances in generative compression methods have demonstrated remarkable progress in enhancing the perceptual quality of compressed data, especially in scenarios with low bitrates. However, their efficacy and applicability to achieve extreme compression ratios ($<0.05$ bpp) remain constrained. In this work, we propose a simple yet effective coding framework by introducing vector quantization (VQ)--based generative models into the image compression domain. The main insight is that the codebook learned by the VQGAN model yields a strong expressive capacity, facilitating efficient compression of continuous information in the latent space while maintaining reconstruction quality. Specifically, an image can be represented as VQ-indices by finding the nearest codeword, which can be encoded using lossless compression methods into bitstreams. We propose clustering a pre-trained large-scale codebook into smaller codebooks through the K-means algorithm, yielding variable bitrates and different levels of reconstruction quality within the coding framework. Furthermore, we introduce a transformer to predict lost indices and restore images in unstable environments. Extensive qualitative and quantitative experiments on various benchmark datasets demonstrate that the proposed framework outperforms state-of-the-art codecs in terms of perceptual quality-oriented metrics and human perception at extremely low bitrates ($\le 0.04$ bpp). Remarkably, even with the loss of up to $20\%$ of indices, the images can be effectively restored with minimal perceptual loss.
In Autonomous Driving (AD), real-time perception is a critical component responsible for detecting surrounding objects to ensure safe driving. While researchers have extensively explored the integrity of AD perception due to its safety and security implications, the aspect of availability (real-time performance) or latency has received limited attention. Existing works on latency-based attack have focused mainly on object detection, i.e., a component in camera-based AD perception, overlooking the entire camera-based AD perception, which hinders them to achieve effective system-level effects, such as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for generating adversarial attacks to increase the execution time of camera-based AD perception. We propose a novel two-stage attack strategy along with the three new loss function designs. Our evaluation is conducted on four popular camera-based AD perception pipelines, and the results demonstrate that SlowTrack significantly outperforms existing latency-based attacks while maintaining comparable imperceptibility levels. Furthermore, we perform the evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL, a production-grade AD simulator, with two scenarios to compare the system-level effects of SlowTrack and existing attacks. Our evaluation results show that the system-level effects can be significantly improved, i.e., the vehicle crash rate of SlowTrack is around 95% on average while existing works only have around 30%.
Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.
Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling. A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens, i.e. low speeds due to the high number of steps involved during sampling. In this survey, we provide a comprehensive review of articles on denoising diffusion models applied in vision, comprising both theoretical and practical contributions in the field. First, we identify and present three generic diffusion modeling frameworks, which are based on denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. We further discuss the relations between diffusion models and other deep generative models, including variational auto-encoders, generative adversarial networks, energy-based models, autoregressive models and normalizing flows. Then, we introduce a multi-perspective categorization of diffusion models applied in computer vision. Finally, we illustrate the current limitations of diffusion models and envision some interesting directions for future research.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.