The spread of false and misleading information is receiving significant attention from legislative and regulatory bodies. Consumers place trust in specific sources of information, so a scalable, interoperable method for determining the provenance and authenticity of information is needed. In this paper we analyze the posting of broadcast news content to a social media platform, the role of open standards, the interplay of cryptographic metadata and watermarks when validating provenance, and likely success and failure scenarios. We conclude that the open standards for cryptographically authenticated metadata developed by the Coalition for Provenance and Authenticity (C2PA) and for audio and video watermarking developed by the Advanced Television Systems Committee (ATSC) are well suited to address broadcast provenance. We suggest methods for using these standards for optimal success.
Learning representative, robust and discriminative information from images is essential for effective person re-identification (Re-Id). In this paper, we propose a compound approach for end-to-end discriminative deep feature learning for person Re-Id based on both body and hand images. We carefully design the Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network architecture consisting of one branch for spatial attention, one branch for channel attention, one branch for global feature representations and another branch for local feature representations. The attention branches focus on the relevant features of the image while suppressing the irrelevant backgrounds. In order to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling, we integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels. The global branch intends to preserve the global context or structural information. For the the local branch, which intends to capture the fine-grained information, we perform uniform partitioning to generate stripes on the conv-layer horizontally. We retrieve the parts by conducting a soft partition without explicitly partitioning the images or requiring external cues such as pose estimation. A set of ablation study shows that each component contributes to the increased performance of the LAGA-Net. Extensive evaluations on four popular body-based person Re-Id benchmarks and two publicly available hand datasets demonstrate that our proposed method consistently outperforms existing state-of-the-art methods.
The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for safeguarding data privacy, enabling encrypted data to be processed securely in the cloud. However, we observe that encryption and decryption routines of some HE schemes require considerable computational resources, presenting non-trivial work for clients. In this paper, we propose an outsourced decryption protocol for RLWE-based HE schemes, which splits the original decryption into two routines, with the computationally intensive part executed remotely by the cloud. Its security relies on an invariant of the NTRU-search problem with a newly designed secret distribution. Cryptographic analyses are conducted to configure protocol parameters across varying security levels. Our experiments demonstrate that the proposed protocol achieves up to a $67\%$ acceleration in the client's local decryption, accompanied by a $50\%$ reduction in space usage.
In survival analysis, frailty variables are often used to model the association in multivariate survival data. Identifiability is an important issue while working with such multivariate survival data with or without competing risks. In this work, we consider bivariate survival data with competing risks and investigate identifiability results with non-parametric baseline cause-specific hazards and different types of Gamma frailty. Prior to that, we prove that, when both baseline cause-specific hazards and frailty distributions are non-parametric, the model is not identifiable. We also construct a non-identifiable model when baseline cause-specific hazards are non-parametric but frailty distribution may be parametric. Thereafter, we consider four different Gamma frailty distributions, and the corresponding models are shown to be identifiable under fairly general assumptions.
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.
Instrumental variable models are central to the inference of causal effects in many settings. We consider the instrumental variable model with discrete variables where the instrument (Z), exposure (X) and outcome (Y) take Q, K, and M levels respectively. We assume that the instrument is randomized and that there is no direct effect of Z on Y so that Y(x,z) = Y(x). We first provide a simple characterization of the set of joint distributions of the potential outcomes P(Y(x=1), ..., Y(x=K)) compatible with a given observed distribution P(X, Y | Z). We then discuss the variation (in)dependence property of the marginal probability distribution of the potential outcomes P(Y(x=1)), ..., P(Y(x=K)) which has direct implications for partial identification of average causal effect contrasts such as E[Y(x=i) - Y(x=j)]. We also include simulation results on the volume of the observed distributions not compatible with the IV model as K and Q change.
The recent proliferation of computers and the internet have opened new opportunities for collecting and processing data. However, such data are often obtained without a well-planned probability survey design. Such non-probability based samples cannot be automatically regarded as representative of the population of interest. Several classes of methods for estimation and inferences from non-probability samples have been developed in recent years. The quasi-randomization methods assume that non-probability sample selection is governed by an underlying latent random mechanism. The basic idea is to use information collected from a probability ("reference") sample to uncover latent non-probability survey participation probabilities (also known as "propensity scores") and use them in estimation of target finite population parameters. In this paper, we review and compare theoretical properties of recently developed methods of estimation survey participation probabilities and study their relative performances in simulations.
Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic representation of DNNs for accurately estimating the MI. Leveraging the proposed MI estimator, we validate the information theoretic explanation for generalization, and derive a tighter generalization bound than the state-of-the-art relaxations.
Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.
Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. Soft attention mechanisms show promising performance in modeling local/global dependencies by soft probabilities between every two tokens, but they are not effective and efficient when applied to long sentences. By contrast, hard attention mechanisms directly select a subset of tokens but are difficult and inefficient to train due to their combinatorial nature. In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other. In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient. Using two RSS modules, ReSA efficiently extracts the sparse dependencies between each pair of selected tokens. We finally propose an RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA. It achieves state-of-the-art performance on both Stanford Natural Language Inference (SNLI) and Sentences Involving Compositional Knowledge (SICK) datasets.