High performance computing for low power devices can be useful to speed up calculations on processors that use a lower clock rate than computers for which energy efficiency is not an issue. In this trial, different high performance techniques for Android devices have been compared, with a special focus on the use of the GPU. Although not officially supported, the OpenCL framework can be used on Android tablets. For the comparison of the different parallel programming paradigms, a benchmark was chosen that could be implemented easily with all frameworks. The Mandelbrot algorithm is computationally intensive and has very few input and output operations. The algorithm has been implemented in Java, C, C with assembler, C with SIMD assembler, C with OpenCL and scalar instructions and C with OpenCL and vector instructions. The implementations have been tested for all architectures currently supported by Android. High speedups can be achieved using SIMD and OpenCL, although the implementation is not straightforward for either one. Apps that use the GPU must account for the fact that they can be suspended by the user at any moment. In using the OpenCL framework on the GPU of Android devices, a computational power comparable to those of modern high speed CPUs can be made available to the software developer.
Warded Datalog+- extends the logic-based language Datalog with existential quantifiers in rule heads. Existential rules are needed for advanced reasoning tasks, e.g., ontological reasoning. The theoretical efficiency guarantees of Warded Datalog+- do not cover extensions crucial for data analytics, such as arithmetic. Moreover, despite the significance of arithmetic for common data analytic scenarios, no decidable fragment of any Datalog+- language extended with arithmetic has been identified. We close this gap by defining a new language that extends Warded Datalog+- with arithmetic and prove its P-completeness. Furthermore, we present an efficient reasoning algorithm for our newly defined language and prove descriptive complexity results for a recently introduced Datalog fragment with integer arithmetic, thereby closing an open question. We lay the theoretical foundation for highly expressive Datalog+- languages that combine the power of advanced recursive rules and arithmetic while guaranteeing efficient reasoning algorithms for applications in modern AI systems, such as Knowledge Graphs.
Subspace recycling iterative methods and other subspace augmentation schemes are a successful extension to Krylov subspace methods in which a Krylov subspace is augmented with a fixed subspace spanned by vectors deemed to be helpful in accelerating convergence or conveying knowledge of the solution. Recently, a survey was published, in which a framework describing the vast majority of such methods was proposed [Soodhalter et al, GAMM-Mitt. 2020]. In many of these methods, the Krylov subspace is one generated by the system matrix composed with a projector that depends on the augmentation space. However, it is not a requirement that a projected Krylov subspace be used. There are augmentation methods built on using Krylov subspaces generated by the original system matrix, and these methods also fit into the general framework. In this note, we observe that one gains implementation benefits by considering such augmentation methods with unprojected Krylov subspaces in the general framework. We demonstrate this by applying the idea to the R$^3$GMRES method proposed in [Dong et al. ETNA 2014] to obtain a simplified implementation and to connect that algorithm to early augmentation schemes based on flexible preconditioning [Saad. SIMAX 1997].
In modern data analysis, sparse model selection becomes inevitable once the number of predictors variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Since model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification and where we use Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.
Inter-component communication (ICC) is a widely used mechanism in mobile apps, which enables message-based control flow transferring and data passing between components. Effective ICC resolution requires precisely identifying entry points, analyzing data values of ICC fields, modeling related framework APIs, etc. Due to various control-flow and data-value tracking related characteristics involved and the lack of oracles for real-world apps, the practical evaluation of ICC resolution techniques is challenging. To fill this gap, we collect multiple-type benchmark suites with 4,104 apps, covering the characteristic-specific, open-source, and commercial ones. Considering the differences of benchmarks, we adopt various evaluation metrics for them, which are based on the number count, graph structure, and reliable oracle. As the oracle for real-world apps is unavailable, we design a dynamic analysis approach to extract the real ICC links triggered during GUI exploration. Overall, we manually confirm 1,680 ICCs and label 42,000 code characteristic tags to form a reliable oracle set. The evaluation performed on six off-the-shelf ICC resolution tools show that tools behave inconsistently on multiple benchmarks and with different metrics. Using reliable oracles, we find that up to 39\% - 85\% ICCs are missed by the six tools, for their inadequate analysis of specific code characteristics. And with the help of graph metrics, we identify many wrongly reported ICCs caused by conservative analysis or the transitivity of imprecision. Finally, we summarize eight FN/FP patterns in ICC resolution for further improvement.
A finite permutation group $G$ on $\Omega$ is called a rank 3 group if it has precisely three orbits in its induced action on $\Omega \times \Omega$. The largest permutation group on $\Omega$ having the same orbits as $G$ on $\Omega \times \Omega$ is called the 2-closure of $G$. We construct a polynomial-time algorithm which given generators of a rank 3 group computes generators of its 2-closure.
Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). Implementation is available at: //github.com/mrT23/TResNet
With the emergence of edge computing, there is an increasing need for running convolutional neural network based object detection on small form factor edge computing devices with limited compute and thermal budget for applications such as video surveillance. To address this problem, efficient object detection frameworks such as YOLO and SSD were proposed. However, SSD based object detection that uses VGG16 as backend network is insufficient to achieve real time speed on edge devices. To further improve the detection speed, the backend network is replaced by more efficient networks such as SqueezeNet and MobileNet. Although the speed is greatly improved, it comes with a price of lower accuracy. In this paper, we propose an efficient SSD named Fire SSD. Fire SSD achieves 70.7mAP on Pascal VOC 2007 test set. Fire SSD achieves the speed of 30.6FPS on low power mainstream CPU and is about 6 times faster than SSD300 and has about 4 times smaller model size. Fire SSD also achieves 22.2FPS on integrated GPU.
Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform or compete with its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.