亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

This paper proposes a new notion of Markov $\alpha$-potential games to study Markov games. Two important classes of practically significant Markov games, Markov congestion games and the perturbed Markov team games, are analyzed in this framework of Markov $\alpha$-potential games, with explicit characterization of the upper bound for $\alpha$ and its relation to game parameters. Moreover, any maximizer of the $\alpha$-potential function is shown to be an $\alpha$-stationary Nash equilibrium of the game. Furthermore, two algorithms for the Nash regret analysis, namely the projected gradient-ascent algorithm and the sequential maximum improvement algorithm, are presented and corroborated by numerical experiments.

相關內容

In deep learning, classification tasks are formalized as optimization problems solved via the minimization of the cross-entropy. However, recent advancements in the design of objective functions allow the $f$-divergence measure to generalize the formulation of the optimization problem for classification. With this goal in mind, we adopt a Bayesian perspective and formulate the classification task as a maximum a posteriori probability problem. We propose a class of objective functions based on the variational representation of the $f$-divergence, from which we extract a list of five posterior probability estimators leveraging well-known $f$-divergences. In addition, driven by the challenge of improving the state-of-the-art approach, we propose a bottom-up method that leads us to the formulation of a new objective function (and posterior probability estimator) corresponding to a novel $f$-divergence referred to as shifted log (SL). First, we theoretically prove the convergence property of the posterior probability estimators. Then, we numerically test the set of proposed objective functions in three application scenarios: toy examples, image data sets, and signal detection/decoding problems. The analyzed tasks demonstrate the effectiveness of the proposed estimators and that the SL divergence achieves the highest classification accuracy in almost all the scenarios.

In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.

We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate over iterations and consequently the error rate soars. In order to tackle this challenge, we reshape the role of pseudo-labels and create a hierarchical order of information. In addition, a crucial step in self-training is to use the classifier confidence prediction to select the best candidate pseudo-labels. This step cannot be efficiently done by neural networks, because it is known that their output is poorly calibrated. To overcome this challenge, we propose a hybrid metric to replace the plain confidence measurement. Our metric takes into account the prediction uncertainty via a subsampling technique. We evaluate our model in a set of five standard benchmarks, and show that it significantly outperforms a set of ten diverse baseline models. Furthermore, we show that the improvement achieved by our model is additive to language model pretraining, which is a widely used technique for using unlabeled documents. Our code is available at //github.com/p-karisani/RST.

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

The minimum set cover (MSC) problem admits two classic algorithms: a greedy $\ln n$-approximation and a primal-dual $f$-approximation, where $n$ is the universe size and $f$ is the maximum frequency of an element. Both algorithms are simple and efficient, and remarkably -- one cannot improve these approximations under hardness results by more than a factor of $(1+\epsilon)$, for any constant $\epsilon > 0$. In their pioneering work, Gupta et al. [STOC'17] showed that the greedy algorithm can be dynamized to achieve $O(\log n)$-approximation with update time $O(f \log n)$. Building on this result, Hjuler et al. [STACS'18] dynamized the greedy minimum dominating set (MDS) algorithm, achieving a similar approximation with update time $O(\Delta \log n)$ (the analog of $O(f \log n)$), albeit for unweighted instances. The approximations of both algorithms, which are the state-of-the-art, exceed the static $\ln n$-approximation by a rather large constant factor. In sharp contrast, the current best dynamic primal-dual MSC algorithms achieve fast update times together with an approximation that exceeds the static $f$-approximation by a factor of (at most) $1+\epsilon$, for any $\epsilon > 0$. This paper aims to bridge the gap between the best approximation factor of the dynamic greedy MSC and MDS algorithms and the static $\ln n$ bound. We present dynamic algorithms for weighted greedy MSC and MDS with approximation $(1+\epsilon)\ln n$ for any $\epsilon > 0$, while achieving the same update time (ignoring dependencies on $\epsilon$) of the best previous algorithms (with approximation significantly larger than $\ln n$). Moreover, [...]

We propose a new representation of functions in Sobolev spaces on an $N$-dimensional hyper-rectangle, expressing such functions in terms of their admissible derivatives, evaluated along lower-boundaries of the domain. These boundary values are either finite-dimensional or exist in the space $L_{2}$ of square-integrable functions -- free of the continuity constraints inherent to Sobolev space. Moreover, we show that the map from this space of boundary values to the Sobolev space is given by an integral operator with polynomial kernel, and we prove that this map is invertible. Using this result, we propose a method for polynomial approximation of functions in Sobolev space, reconstructing such an approximation from polynomial projections of the boundary values. We prove that this approximation is optimal with respect to a discrete-continuous Sobolev norm, and show through numerical examples that it exhibits better convergence behavior than direct projection of the function. Finally, we show that this approach may also be adapted to use a basis of step functions, to construct accurate piecewise polynomial approximations that do not suffer from e.g. Gibbs phenomenon.

Few-shot object detection (FSOD) aims to achieve object detection only using a few novel class training data. Most of the existing methods usually adopt a transfer-learning strategy to construct the novel class distribution by transferring the base class knowledge. However, this direct way easily results in confusion between the novel class and other similar categories in the decision space. To address the problem, we propose generating local reverse samples (LRSamples) in Prototype Reference Frames to adaptively adjust the center position and boundary range of the novel class distribution to learn more discriminative novel class samples for FSOD. Firstly, we propose a Center Calibration Variance Augmentation (CCVA) module, which contains the selection rule of LRSamples, the generator of LRSamples, and augmentation on the calibrated distribution centers. Specifically, we design an intra-class feature converter (IFC) as the generator of CCVA to learn the selecting rule. By transferring the knowledge of IFC from the base training to fine-tuning, the IFC generates plentiful novel samples to calibrate the novel class distribution. Moreover, we propose a Feature Density Boundary Optimization (FDBO) module to adaptively adjust the importance of samples depending on their distance from the decision boundary. It can emphasize the importance of the high-density area of the similar class (closer decision boundary area) and reduce the weight of the low-density area of the similar class (farther decision boundary area), thus optimizing a clearer decision boundary for each category. We conduct extensive experiments to demonstrate the effectiveness of our proposed method. Our method achieves consistent improvement on the Pascal VOC and MS COCO datasets based on DeFRCN and MFDC baselines.

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

北京阿比特科技有限公司