The evolution of skyline and ranking queries has created new archetypes like flexible skylines, which have proven to be an efficient method to select relevant data from large datasets using multi objective optimization. This paper aims to study the possible applications of Poisson distribution mass function as a monotonic scoring function in flexible skyline processes, especially those featuring schemas whose attributes can be translated to constant mean rates. Moreover, a method to express users's requirement by means of the F-dominant set of tuples will be proposed using parametrical variations in F[1], simultaneously, algorithm construction and potential applications will be studied.
Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled.
Coordinated Multiple views (CMVs) are a visualization technique that simultaneously presents multiple visualizations in separate but linked views. There are many studies that report the advantages (e.g., usefulness for finding hidden relationships) and disadvantages (e.g., cognitive load) of CMVs. But little empirical work exists on the impact of the number of views on visual anlaysis results and processes, which results in uncertainty in the relationship between the view number and visual anlaysis. In this work, we aim at investigating the relationship between the number of coordinated views and users analytic processes and results. To achieve the goal, we implemented a CMV tool for visual anlaysis. We also provided visualization duplication in the tool to help users easily create a desired number of visualization views on-the-fly. We conducted a between-subject study with 44 participants, where we asked participants to solve five analytic problems using the visual tool. Through quantitative and qualitative analysis, we discovered the positive correlation between the number of views and analytic results. We also found that visualization duplication encourages users to create more views and to take various analysis strategies. Based on the results, we provide implications and limitations of our study.
Most methods for automated full-bore rock core image analysis (description, colour, properties distribution, etc.) are based on separate core column analyses. The core is usually imaged in a box because of the significant amount of time taken to get an image for each core column. The work presents an innovative method and algorithm for core columns extraction from core boxes. The conditions for core boxes imaging may differ tremendously. Such differences are disastrous for machine learning algorithms which need a large dataset describing all possible data variations. Still, such images have some standard features - a box and core. Thus, we can emulate different environments with a unique augmentation described in this work. It is called template-like augmentation (TLA). The method is described and tested on various environments, and results are compared on an algorithm trained on both 'traditional' data and a mix of traditional and TLA data. The algorithm trained with TLA data provides better metrics and can detect core on most new images, unlike the algorithm trained on data without TLA. The algorithm for core column extraction implemented in an automated core description system speeds up the core box processing by a factor of 20.
Background: Transrectal ultrasound guided systematic biopsies of the prostate is a routine procedure to establish a prostate cancer diagnosis. However, the 10-12 prostate core biopsies only sample a relatively small volume of the prostate, and tumour lesions in regions between biopsy cores can be missed, leading to a well-known low sensitivity to detect clinically relevant cancer. As a proof-of-principle, we developed and validated a deep convolutional neural network model to distinguish between morphological patterns in benign prostate biopsy whole slide images from men with and without established cancer. Methods: This study included 14,354 hematoxylin and eosin stained whole slide images from benign prostate biopsies from 1,508 men in two groups: men without an established prostate cancer (PCa) diagnosis and men with at least one core biopsy diagnosed with PCa. 80% of the participants were assigned as training data and used for model optimization (1,211 men), and the remaining 20% (297 men) as a held-out test set used to evaluate model performance. An ensemble of 10 deep convolutional neural network models was optimized for classification of biopsies from men with and without established cancer. Hyperparameter optimization and model selection was performed by cross-validation in the training data . Results: Area under the receiver operating characteristic curve (ROC-AUC) was estimated as 0.727 (bootstrap 95% CI: 0.708-0.745) on biopsy level and 0.738 (bootstrap 95% CI: 0.682 - 0.796) on man level. At a specificity of 0.9 the model had an estimated sensitivity of 0.348. Conclusion: The developed model has the ability to detect men with risk of missed PCa due to under-sampling of the prostate. The proposed model has the potential to reduce the number of false negative cases in routine systematic prostate biopsies and to indicate men who could benefit from MRI-guided re-biopsy.
Satellites and their instruments are subject to the motion stability throughout their lifetimes. The reliability of the large flexible space structures (LFSS) is particularly important for the motion stability of satellites and their instruments. In this paper, the reliability analysis of large flexible space structures is conducted based on Bayesian support vector regression (SVR). The kinematic model of a typical large flexible space structure is first established. Based on the kinematic model, the surrogate model of the motion of the large flexible space structure is then developed to further reduce the computational cost. Finally, the reliability analysis is conducted using the surrogate model. The proposed method shows high accuracy and efficiency for the reliability assessments of the typical large flexible space structure and can be further developed for other LFSS.
Spectral efficiency improvement is a key focus in most wireless communication systems and achieved by various means such as using large antenna arrays and/or advanced modulation schemes and signal formats. This work proposes to further improve spectral efficiency through combining non-orthogonal spectrally efficient frequency division multiplexing (SEFDM) systems with index modulation (IM), which can efficiently make use of the indices of activated subcarriers as communication information. Recent research has verified that IM may be used with SEFDM to alleviate inter-carrier interference (ICI) and improve error performance. This work proposes new SEFDM signal formats based on novel activation pattern designs, which limit the locations of activated subcarriers and enable a variable number of activated subcarriers in each SEFDM subblock. SEFDM-IM system designs are developed by jointly considering activation patterns, modulation schemes and signal waveform formats, with a set of solutions evaluated under different spectral efficiency scenarios. Detailed modelling of coded systems and simulation studies reveal that the proposed designs not only lead to better bit error rate (BER) but also lower peak-to-average power ratio (PAPR) and reduced computational complexity relative to other reported index-modulated systems.
In this paper we study the finite sample and asymptotic properties of various weighting estimators of the local average treatment effect (LATE), several of which are based on Abadie (2003)'s kappa theorem. Our framework presumes a binary endogenous explanatory variable ("treatment") and a binary instrumental variable, which may only be valid after conditioning on additional covariates. We argue that one of the Abadie estimators, which we show is weight normalized, is likely to dominate the others in many contexts. A notable exception is in settings with one-sided noncompliance, where certain unnormalized estimators have the advantage of being based on a denominator that is bounded away from zero. We use a simulation study and three empirical applications to illustrate our findings. In applications to causal effects of college education using the college proximity instrument (Card, 1995) and causal effects of childbearing using the sibling sex composition instrument (Angrist and Evans, 1998), the unnormalized estimates are clearly unreasonable, with "incorrect" signs, magnitudes, or both. Overall, our results suggest that (i) the relative performance of different kappa weighting estimators varies with features of the data-generating process; and that (ii) the normalized version of Tan (2006)'s estimator may be an attractive alternative in many contexts. Applied researchers with access to a binary instrumental variable should also consider covariate balancing or doubly robust estimators of the LATE.
Tensor PCA is a stylized statistical inference problem introduced by Montanari and Richard to study the computational difficulty of estimating an unknown parameter from higher-order moment tensors. Unlike its matrix counterpart, Tensor PCA exhibits a statistical-computational gap, i.e., a sample size regime where the problem is information-theoretically solvable but conjectured to be computationally hard. This paper derives computational lower bounds on the run-time of memory bounded algorithms for Tensor PCA using communication complexity. These lower bounds specify a trade-off among the number of passes through the data sample, the sample size, and the memory required by any algorithm that successfully solves Tensor PCA. While the lower bounds do not rule out polynomial-time algorithms, they do imply that many commonly-used algorithms, such as gradient descent and power method, must have a higher iteration count when the sample size is not large enough. Similar lower bounds are obtained for Non-Gaussian Component Analysis, a family of statistical estimation problems in which low-order moment tensors carry no information about the unknown parameter. Finally, stronger lower bounds are obtained for an asymmetric variant of Tensor PCA and related statistical estimation problems. These results explain why many estimators for these problems use a memory state that is significantly larger than the effective dimensionality of the parameter of interest.
One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
This paper addresses the difficulty of forecasting multiple financial time series (TS) conjointly using deep neural networks (DNN). We investigate whether DNN-based models could forecast these TS more efficiently by learning their representation directly. To this end, we make use of the dynamic factor graph (DFG) from that we enhance by proposing a novel variable-length attention-based mechanism to render it memory-augmented. Using this mechanism, we propose an unsupervised DNN architecture for multivariate TS forecasting that allows to learn and take advantage of the relationships between these TS. We test our model on two datasets covering 19 years of investment funds activities. Our experimental results show that our proposed approach outperforms significantly typical DNN-based and statistical models at forecasting their 21-day price trajectory.