Virtual reality (VR) telepresence applications and the so-called "metaverse" promise to be the next major medium of interaction with the internet. However, with numerous recent studies showing the ease at which VR users can be profiled, deanonymized, and data harvested, metaverse platforms carry all the privacy risks of the current internet and more while at present having none of the defensive privacy tools we are accustomed to using on the web. To remedy this, we present the first known method of implementing an "incognito mode" for VR. Our technique leverages local differential privacy to quantifiably obscure sensitive user data attributes, with a focus on intelligently adding noise when and where it is needed most to maximize privacy while minimizing usability impact. Moreover, our system is capable of flexibly adapting to the unique needs of each metaverse application to further optimize this trade-off. We implement our solution as a universal Unity (C#) plugin that we then evaluate using several popular VR applications. Upon faithfully replicating the most well-known VR privacy attack studies, we show a significant degradation of attacker capabilities when using our proposed solution.
Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting. This study introduces a more efficient training strategy to mitigate overfitting and reduce computational requirements. We balance the training process with a mix of negative and positive samples from the outset, named as Joint learning with Negative samples (JN). Negative samples refer to scenarios where the object from the template is not present in the search region, which helps to prevent the model from simply memorizing the target, and instead encourages it to use the template for object location. To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target's location in the presence of negative samples, offering an efficient way to manage the mixed sample training. Furthermore, our approach introduces a target-indicating token. It encapsulates the target's precise location within the template image. This method provides exact boundary details with negligible computational cost but improving performance. Our model, JN-256, exhibits superior performance on challenging benchmarks, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet. Notably, JN-256 outperforms previous SOTA trackers that utilize larger models and higher input resolutions, even though it is trained with only half the number of data sampled used in those works.
Graph representations are the generalization of geometric graph drawings from the plane to higher dimensions. A method introduced by Tutte to optimize properties of graph drawings is to minimize their energy. We explore this minimization for spherical graph representations, where the vertices lie on a unit sphere such that the origin is their barycentre. We present a primal and dual semidefinite program which can be used to find such a spherical graph representation minimizing the energy. We denote the optimal value of this program by $\rho(G)$ for a given graph $G$. The value turns out to be related to the second largest eigenvalue of the adjacency matrix of $G$, which we denote by $\lambda_2$. We show that for $G$ regular, $\rho(G) \leq \frac{\lambda_{2}}{2} \cdot v(G)$, and that equality holds if and only if the $\lambda_{2}$ eigenspace contains a spherical 1-design. Moreover, if $G$ is a random $d$-regular graph, $\rho(G)=\left(\sqrt{(d-1)} +o(1)\right)\cdot v(G)$, asymptotically almost surely.
Intelligent metasurface has recently emerged as a promising technology that enables the customization of wireless environments by harnessing large numbers of inexpensive configurable scattering elements. However, prior studies have predominantly focused on single-layer metasurfaces, which have limitations in terms of the number of beam patterns they can steer accurately due to practical hardware restrictions. In contrast, this paper introduces a novel stacked intelligent metasurface (SIM) design. Specifically, we investigate the integration of SIM into the downlink of a multiuser multiple-input single-output (MISO) communication system, where a SIM, consisting of a multilayer metasurface structure, is deployed at the base station (BS) to facilitate transmit beamforming in the electromagnetic wave domain. This eliminates the need for conventional digital beamforming and high-resolution digital-to-analog converters at the BS. To this end, we formulate an optimization problem that aims to maximize the sum rate of all user equipments by jointly optimizing the transmit power allocation at the BS and the wave-based beamforming at the SIM, subject to both the transmit power budget and discrete phase shift constraints. Furthermore, we propose a computationally efficient algorithm for solving this joint optimization problem and elaborate on the potential benefits of employing SIM in wireless networks. Finally, the numerical results corroborate the effectiveness of the proposed SIM-enabled wave-based beamforming design and evaluate the performance improvement achieved by the proposed algorithm compared to various benchmark schemes. It is demonstrated that considering the same number of transmit antennas, the proposed SIM-based system achieves about 200\% improvement in terms of sum rate compared to conventional MISO systems.
We extend classical methods of computational complexity to the setting of distributed computing, where they prove even more effective in some respects than in their original context. Instead of a single computer, several networked computers communicate via synchronous message-passing to collectively solve some decision problem related to the network topology. Their running time is limited in two ways: the number of communication rounds is bounded by a constant, and the number of computation steps of each computer is polynomially bounded by the size of its local input and the messages it receives. By letting two players take turns assigning certificates to the computers, we obtain a generalization of the polynomial hierarchy (and hence of the complexity classes $\mathbf{P}$ and $\mathbf{NP}$). We then extend some key results of complexity theory to this setting, in particular the Cook-Levin theorem (which identifies Boolean satisfiability as a complete problem for $\mathbf{NP}$), and Fagin's theorem (which characterizes $\mathbf{NP}$ as the problems expressible in existential second-order logic). The original results can be recovered as the special case where the network consists of a single computer. But perhaps more surprisingly, the task of separating complexity classes becomes easier in the general case: we can show that our hierarchy is infinite, while it remains notoriously open whether the same is true in the case of a single computer. (By contrast, a collapse of our hierarchy would have implied a collapse of the polynomial hierarchy.) As an application, we propose quantifier alternation as a new approach to measuring the locality of problems in distributed computing.
Many research explore how well computers are able to examine emotions displayed by humans and use that data to perform different tasks. However, there have been very few research which evaluate the computers ability to generate emotion classification information in an attempt to help the user make decisions or perform tasks. This is a crucial area to explore as it is paramount to the two way communication between humans and computers. This research conducted an experiment to investigate the impact of different uncertainty information displays of emotion classification on the human decision making process. Results show that displaying more uncertainty information can help users to be more confident when making decisions.
We introduce a new methodology to conduct simultaneous inference of the nonparametric component in partially linear time series regression models where the nonparametric part is a multivariate unknown function. In particular, we construct a simultaneous confidence region (SCR) for the multivariate function by extending the high-dimensional Gaussian approximation to dependent processes with continuous index sets. Our results allow for a more general dependence structure compared to previous works and are widely applicable to a variety of linear and nonlinear autoregressive processes. We demonstrate the validity of our proposed methodology by examining the finite-sample performance in the simulation study. Finally, an application in time series, the forward premium regression, is presented, where we construct the SCR for the foreign exchange risk premium from the exchange rate and macroeconomic data.
The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.
Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate adversarial perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under state-of-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.