亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

In recent years, dynamic parameterization of acoustic environments has raised increasing attention in the field of audio processing. One of the key parameters that characterize the local room acoustics in isolation from orientation and directivity of sources and receivers is the geometric room volume. Convolutional neural networks (CNNs) have been widely selected as the main models for conducting blind room acoustic parameter estimation, which aims to learn a direct mapping from audio spectrograms to corresponding labels. With the recent trend of self-attention mechanisms, this paper introduces a purely attention-based model to blindly estimate room volumes based on single-channel noisy speech signals. We demonstrate the feasibility of eliminating the reliance on CNN for this task and the proposed Transformer architecture takes Gammatone magnitude spectral coefficients and phase spectrograms as inputs. To enhance the model performance given the task-specific dataset, cross-modality transfer learning is also applied. Experimental results demonstrate that the proposed model outperforms traditional CNN models across a wide range of real-world acoustics spaces, especially with the help of the dedicated pretraining and data augmentation schemes.

相關內容

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.

Subjective image quality assessment studies are used in many scenarios, such as the evaluation of compression, super-resolution, and denoising solutions. Among the available subjective test methodologies, pair comparison is attracting popularity due to its simplicity, reliability, and robustness to changes in the test conditions, e.g. display resolutions. The main problem that impairs its wide acceptance is that the number of pairs to compare by subjects grows quadratically with the number of stimuli that must be considered. Usually, the paired comparison data obtained is fed into an aggregation model to obtain a final score for each degraded image and thus, not every comparison contributes equally to the final quality score. In the past years, several solutions that sample pairs (from all possible combinations) have been proposed, from random sampling to active sampling based on the past subjects' decisions. This paper introduces a novel sampling solution called \textbf{P}redictive \textbf{S}ampling for \textbf{P}airwise \textbf{C}omparison (PS-PC) which exploits the characteristics of the input data to make a prediction of which pairs should be evaluated by subjects. The proposed solution exploits popular machine learning techniques to select the most informative pairs for subjects to evaluate, while for the other remaining pairs, it predicts the subjects' preferences. The experimental results show that PS-PC is the best choice among the available sampling algorithms with higher performance for the same number of pairs. Moreover, since the choice of the pairs is done \emph{a priori} before the subjective test starts, the algorithm is not required to run during the test and thus much more simple to deploy in online crowdsourcing subjective tests.

Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.

Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.

Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.

Video anomaly detection deals with the recognition of abnormal events in videos. Apart from the visual signal, video anomaly detection has also been addressed with the use of skeleton sequences. We propose a holistic representation of skeleton trajectories to learn expected motions across segments at different times. Our approach uses multitask learning to reconstruct any continuous unobserved temporal segment of the trajectory allowing the extrapolation of past or future segments and the interpolation of in-between segments. We use an end-to-end attention-based encoder-decoder. We encode temporally occluded trajectories, jointly learn latent representations of the occluded segments, and reconstruct trajectories based on expected motions across different temporal segments. Extensive experiments on three trajectory-based video anomaly detection datasets show the advantages and effectiveness of our approach with state-of-the-art results on anomaly detection in skeleton trajectories.

Radar systems typically employ well-designed deterministic signals for target sensing, while integrated sensing and communications (ISAC) systems have to adopt random signals to convey useful information. This paper analyzes the sensing and ISAC performance relying on random signaling in a multiantenna system. Towards this end, we define a new sensing performance metric, namely, ergodic linear minimum mean square error (ELMMSE), which characterizes the estimation error averaged over random ISAC signals. Then, we investigate a data-dependent precoding (DDP) scheme to minimize the ELMMSE in sensing-only scenarios, which attains the optimized performance at the cost of high implementation overhead. To reduce the cost, we present an alternative data-independent precoding (DIP) scheme by stochastic gradient projection (SGP). Moreover, we shed light on the optimal structures of both sensing-only DDP and DIP precoders. As a further step, we extend the proposed DDP and DIP approaches to ISAC scenarios, which are solved via a tailored penalty-based alternating optimization algorithm. Our numerical results demonstrate that the proposed DDP and DIP methods achieve substantial performance gains over conventional ISAC signaling schemes that treat the signal sample covariance matrix as deterministic, which proves that random ISAC signals deserve dedicated precoding designs.

The framework of approximate differential privacy is considered, and augmented by introducing the notion of "the total variation of a (privacy-preserving) mechanism" (denoted by $\eta$-TV). With this refinement, an exact composition result is derived, and shown to be significantly tighter than the optimal bounds for differential privacy (which do not consider the total variation). Furthermore, it is shown that $(\varepsilon,\delta)$-DP with $\eta$-TV is closed under subsampling. The induced total variation of commonly used mechanisms are computed. Moreover, the notion of total variation of a mechanism is extended to the local privacy setting and privacy-utility tradeoffs are investigated. In particular, total variation distance and KL divergence are considered as utility functions and upper bounds are derived. Finally, the results are compared and connected to the (purely) locally differentially private setting.

Recent advance in fluorescence microscopy enables acquisition of 3D image volumes with better quality and deeper penetration into tissue. Segmentation is a required step to characterize and analyze biological structures in the images. 3D segmentation using deep learning has achieved promising results in microscopy images. One issue is that deep learning techniques require a large set of groundtruth data which is impractical to annotate manually for microscopy volumes. This paper describes a 3D nuclei segmentation method using 3D convolutional neural networks. A set of synthetic volumes and the corresponding groundtruth volumes are generated automatically using a generative adversarial network. Segmentation results demonstrate that our proposed method is capable of segmenting nuclei successfully in 3D for various data sets.

北京阿比特科技有限公司