亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at //github.com/L-KID/Videoobject-detection-by-location-anticipation.

相關內容

讓 iOS 8 和 OS X Yosemite 無縫切換的一個新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source:

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.

The IceCube Neutrino Observatory is a cubic-kilometer high-energy neutrino detector deployed in the Antarctic ice. Two major event classes are charged-current electron and muon neutrino interactions. In this contribution, we discuss the inference of direction and energy for these classes using conditional normalizing flows. They allow to derive a posterior distribution for each individual event based on the raw data that can include systematic uncertainties, which makes them very promising for next-generation reconstructions. For each normalizing flow we use the differential entropy and the KL-divergence to its maximum entropy approximation to interpret the results. The normalizing flows correctly incorporate complex optical properties of the Antarctic ice and their relation to the embedded detector. For showers, the differential entropy increases in regions of high photon absorption and decreases in clear ice. For muons, the differential entropy strongly correlates with the contained track length. Coverage is maintained, even for low photon counts and highly asymmetrical contour shapes. For high-photon counts, the distributions get narrower and become more symmetrical, as expected from the asymptotic theorem of Bernstein-von-Mises. For shower directional reconstruction, we find the region between 1 TeV and 100 TeV to potentially benefit the most from normalizing flows because of azimuth-zenith asymmetries which have been neglected in previous analyses by assuming symmetrical contours. Events in this energy range play a vital role in the recent discovery of the galactic plane diffuse neutrino emission.

Projection-based Reduced Order Models minimize the discrete residual of a "full order model" (FOM) while constraining the unknowns to a reduced dimension space. For problems with symmetric positive definite (SPD) Jacobians, this is optimally achieved by projecting the full order residual onto the approximation basis (Galerkin Projection). This is sub-optimal for non-SPD Jacobians as it only minimizes the projection of the residual, not the residual itself. An alternative is to directly minimize the 2-norm of the residual, achievable using QR factorization or the method of the normal equations (LSPG). The first approach involves constructing and factorizing a large matrix, while LSPG avoids this but requires constructing a product element by element, necessitating a complementary mesh and adding complexity to the hyper-reduction process. This work proposes an alternative based on Petrov-Galerkin minimization. We choose a left basis for a least-squares minimization on a reduced problem, ensuring the discrete full order residual is minimized. This is applicable to both SPD and non-SPD Jacobians, allowing element-by-element assembly, avoiding the use of a complementary mesh, and simplifying finite element implementation. The technique is suitable for hyper-reduction using the Empirical Cubature Method and is applicable in nonlinear reduction procedures.

The problem of multi-object tracking (MOT) consists in detecting and tracking all the objects in a video sequence while keeping a unique identifier for each object. It is a challenging and fundamental problem for robotics. In precision agriculture the challenge of achieving a satisfactory solution is amplified by extreme camera motion, sudden illumination changes, and strong occlusions. Most modern trackers rely on the appearance of objects rather than motion for association, which can be ineffective when most targets are static objects with the same appearance, as in the agricultural case. To this end, on the trail of SORT [5], we propose AgriSORT, a simple, online, real-time tracking-by-detection pipeline for precision agriculture based only on motion information that allows for accurate and fast propagation of tracks between frames. The main focuses of AgriSORT are efficiency, flexibility, minimal dependencies, and ease of deployment on robotic platforms. We test the proposed pipeline on a novel MOT benchmark specifically tailored for the agricultural context, based on video sequences taken in a table grape vineyard, particularly challenging due to strong self-similarity and density of the instances. Both the code and the dataset are available for future comparisons.

Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias.

Over the last fifty years, the United States have experienced hundreds of mass public shootings that resulted in thousands of victims. Characterized by their frequent occurrence and devastating nature, mass shootings have become a major public health hazard that dramatically impact safety and well-being of individuals and communities. Given the epidemic traits of this phenomenon, there have been concerted efforts to understand the root causes that lead to public mass shootings in order to implement effective prevention strategies. We propose a quantile mixed graphical model for investigating the intricacies of inter- and infra-domain relationships of this complex phenomenon, where conditional relations between discrete and continuous variables are modeled without stringent distributional assumptions using Parzen's definition of mid-quantile. To retrieve the graph structure and recover only the most relevant connections, we consider the neighborhood selection approach in which conditional mid-quantiles of each variable in the network are modeled as a sparse function of all others. We propose a two-step procedure to estimate the graph where, in the first step, conditional mid-probabilities are obtained semi-parametrically and, in the second step, the model parameters are estimated by solving an implicit equation with a LASSO penalty.

We show how to reduce the computational time of the practical implementation of the Raviart-Thomas mixed method for second-order elliptic problems. The implementation takes advantage of a recent result which states that certain local subspaces of the vector unknown can be eliminated from the equations by transforming them into stabilization functions; see the paper published online in JJIAM on August 10, 2023. We describe in detail the new implementation (in MATLAB and a laptop with Intel(R) Core (TM) i7-8700 processor which has six cores and hyperthreading) and present numerical results showing 10 to 20% reduction in the computational time for the Raviart-Thomas method of index $k$, with $k$ ranging from 1 to 20, applied to a model problem.

In recent years, object detection has experienced impressive progress. Despite these improvements, there is still a significant gap in the performance between the detection of small and large objects. We analyze the current state-of-the-art model, Mask-RCNN, on a challenging dataset, MS COCO. We show that the overlap between small ground-truth objects and the predicted anchors is much lower than the expected IoU threshold. We conjecture this is due to two factors; (1) only a few images are containing small objects, and (2) small objects do not appear enough even within each image containing them. We thus propose to oversample those images with small objects and augment each of those images by copy-pasting small objects many times. It allows us to trade off the quality of the detector on large objects with that on small objects. We evaluate different pasting augmentation strategies, and ultimately, we achieve 9.7\% relative improvement on the instance segmentation and 7.1\% on the object detection of small objects, compared to the current state of the art method on MS COCO.

Degradation of image quality due to the presence of haze is a very common phenomenon. Existing DehazeNet [3], MSCNN [11] tackled the drawbacks of hand crafted haze relevant features. However, these methods have the problem of color distortion in gloomy (poor illumination) environment. In this paper, a cardinal (red, green and blue) color fusion network for single image haze removal is proposed. In first stage, network fusses color information present in hazy images and generates multi-channel depth maps. The second stage estimates the scene transmission map from generated dark channels using multi channel multi scale convolutional neural network (McMs-CNN) to recover the original scene. To train the proposed network, we have used two standard datasets namely: ImageNet [5] and D-HAZY [1]. Performance evaluation of the proposed approach has been carried out using structural similarity index (SSIM), mean square error (MSE) and peak signal to noise ratio (PSNR). Performance analysis shows that the proposed approach outperforms the existing state-of-the-art methods for single image dehazing.

北京阿比特科技有限公司