In recent years, the amount of data available on the internet and the number of users who utilize the Internet have increased at an unparalleled pace. The exponential development in the quantity of digital information accessible and the number of Internet users has created the possibility for information overload, impeding fast access to items of interest on the Internet. Information retrieval systems like as Google, DevilFinder, and Altavista have partly overcome this challenge, but prioritizing and customization of information (where a system maps accessible material to a user's interests and preferences) were lacking. This has resulted in a higher-than-ever need for recommender systems. Recommender systems are information filtering systems that address the issue of information overload by filtering important information fragments from a huge volume of dynamically produced data based on the user's interests, favorite things, preferences and ratings on the desired item. Recommender systems can figure out if a person would like an item or not based on their profile.
Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.
We introduce the concept of Automated Causal Discovery (AutoCD), defined as any system that aims to fully automate the application of causal discovery and causal reasoning methods. AutoCD's goal is to deliver all causal information that an expert human analyst would and answer a user's causal queries. We describe the architecture of such a platform, and illustrate its performance on synthetic data sets. As a case study, we apply it on temporal telecommunication data. The system is general and can be applied to a plethora of causal discovery problems.
Autoencoders (AE) are simple yet powerful class of neural networks that compress data by projecting input into low-dimensional latent space (LS). Whereas LS is formed according to the loss function minimization during training, its properties and topology are not controlled directly. In this paper we focus on AE LS properties and propose two methods for obtaining LS with desired topology, called LS configuration. The proposed methods include loss configuration using a geometric loss term that acts directly in LS, and encoder configuration. We show that the former allows to reliably obtain LS with desired configuration by defining the positions and shapes of LS clusters for supervised AE (SAE). Knowing LS configuration allows to define similarity measure in LS to predict labels or estimate similarity for multiple inputs without using decoders or classifiers. We also show that this leads to more stable and interpretable training. We show that SAE trained for clothes texture classification using the proposed method generalizes well to unseen data from LIP, Market1501, and WildTrack datasets without fine-tuning, and even allows to evaluate similarity for unseen classes. We further illustrate the advantages of pre-configured LS similarity estimation with cross-dataset searches and text-based search using a text query without language models.
Several mixed-effects models for longitudinal data have been proposed to accommodate the non-linearity of late-life cognitive trajectories and assess the putative influence of covariates on it. No prior research provides a side-by-side examination of these models to offer guidance on their proper application and interpretation. In this work, we examined five statistical approaches previously used to answer research questions related to non-linear changes in cognitive aging: the linear mixed model (LMM) with a quadratic term, LMM with splines, the functional mixed model, the piecewise linear mixed model, and the sigmoidal mixed model. We first theoretically describe the models. Next, using data from two prospective cohorts with annual cognitive testing, we compared the interpretation of the models by investigating associations of education on cognitive change before death. Lastly, we performed a simulation study to empirically evaluate the models and provide practical recommendations. Except for the LMM-quadratic, the fit of all models was generally adequate to capture non-linearity of cognitive change and models were relatively robust. Although spline-based models have no interpretable nonlinearity parameters, their convergence was easier to achieve, and they allow graphical interpretation. In contrast, piecewise and sigmoidal models, with interpretable non-linear parameters, may require more data to achieve convergence.
We propose a simple empirical representation of expectations such that: For a number of samples above a certain threshold, drawn from any probability distribution with finite fourth-order statistic, the proposed estimator outperforms the empirical average when tested against the actual population, with respect to the quadratic loss. For datasets smaller than this threshold, the result still holds, but for a class of distributions determined by their first four statistics. Our approach leverages the duality between distributionally robust and risk-averse optimization.
We propose a unified view of non-local methods for single-image denoising, for which BM3D is the most popular representative, that operate by gathering noisy patches together according to their similarities in order to process them collaboratively. Our general estimation framework is based on the minimization of the quadratic risk, which is approximated in two steps, and adapts to photon and electronic noises. Relying on unbiased risk estimation (URE) for the first step and on ``internal adaptation'', a concept borrowed from deep learning theory, for the second, we show that our approach enables to reinterpret and reconcile previous state-of-the-art non-local methods. Within this framework, we propose a novel denoiser called NL-Ridge that exploits linear combinations of patches. While conceptually simpler, we show that NL-Ridge can outperform well-established state-of-the-art single-image denoisers.
Estimands using the treatment policy strategy for addressing intercurrent events are common in Phase III clinical trials. One estimation approach for this strategy is retrieved dropout whereby observed data following an intercurrent event are used to multiply impute missing data. However, such methods have had issues with variance inflation and model fitting due to data sparsity. This paper introduces likelihood-based versions of these approaches, investigating and comparing their statistical properties to the existing retrieved dropout approaches, simpler analysis models and reference-based multiple imputation. We use a simulation based upon the data from the PIONEER 1 Phase III clinical trial in Type II diabetics to present complex and relevant estimation challenges. The likelihood-based methods display similar statistical properties to their multiple imputation equivalents, but all retrieved dropout approaches suffer from high variance. Retrieved dropout approaches appear less biased than reference-based approaches, resulting in a bias-variance trade-off, but we conclude that the large degree of variance inflation is often more problematic than the bias. Therefore, only the simpler retrieved dropout models appear appropriate as a primary analysis in a clinical trial, and only where it is believed most data following intercurrent events will be observed. The jump-to-reference approach may represent a more promising estimation approach for symptomatic treatments due to its relatively high power and ability to fit in the presence of much missing data, despite its strong assumptions and tendency towards conservative bias. More research is needed to further develop how to estimate the treatment effect for a treatment policy strategy.
Time-series models typically assume untainted and legitimate streams of data. However, a self-interested adversary may have incentive to corrupt this data, thereby altering a decision maker's inference. Within the broader field of adversarial machine learning, this research provides a novel, probabilistic perspective toward the manipulation of hidden Markov model inferences via corrupted data. In particular, we provision a suite of corruption problems for filtering, smoothing, and decoding inferences leveraging an adversarial risk analysis approach. Multiple stochastic programming models are set forth that incorporate realistic uncertainties and varied attacker objectives. Three general solution methods are developed by alternatively viewing the problem from frequentist and Bayesian perspectives. The efficacy of each method is illustrated via extensive, empirical testing. The developed methods are characterized by their solution quality and computational effort, resulting in a stratification of techniques across varying problem-instance architectures. This research highlights the weaknesses of hidden Markov models under adversarial activity, thereby motivating the need for robustification techniques to ensure their security.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.