We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $ n $ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation. In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a significant problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts $k$ and expert capacity $C$ in top-$k$ routing can significantly make a difference in this context. Furthermore, we take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models. We push the model scale to over $1$ trillion parameters and implement it on solely $480$ NVIDIA V100-32GB GPUs, in comparison with the recent SOTAs on $2048$ TPU cores. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.
Traditionally Bayesian decision-theoretic design of experiments proceeds by choosing a design to minimise expectation of a given loss function over the space of all designs. The loss function encapsulates the aim of the experiment, and the expectation is taken with respect to the joint distribution of all unknown quantities implied by the statistical model that will be fitted to observed responses. In this paper, an extended framework is proposed whereby the expectation of the loss is taken with respect to a joint distribution implied by an alternative statistical model. Motivation for this includes promoting robustness, ensuring computational feasibility and for allowing realistic prior specification when deriving a design. To aid in exploring the new framework, an asymptotic approximation to the expected loss under an alternative model is derived, and the properties of different loss functions are established. The framework is then demonstrated on a linear regression versus full-treatment model scenario, on estimating parameters of a non-linear model under model discrepancy and a cubic spline model under an unknown number of basis functions.
In this paper, we construct the wavelet eigenvalue regression methodology in high dimensions. We assume that possibly non-Gaussian, finite-variance $p$-variate measurements are made of a low-dimensional $r$-variate ($r \ll p$) fractional stochastic process with non-canonical scaling coordinates and in the presence of additive high-dimensional noise. The measurements are correlated both time-wise and between rows. Building upon the asymptotic and large scale properties of wavelet random matrices in high dimensions, the wavelet eigenvalue regression is shown to be consistent and, under additional assumptions, asymptotically Gaussian in the estimation of the fractal structure of the system. We further construct a consistent estimator of the effective dimension $r$ of the system that significantly increases the robustness of the methodology. The estimation performance over finite samples is studied by means of simulations.
Stochastic gradient descent (SGD) is a promising method for solving large-scale inverse problems, due to its excellent scalability with respect to data size. The current mathematical theory in the lens of regularization theory predicts that SGD with a polynomially decaying stepsize schedule may suffer from an undesirable saturation phenomenon, i.e., the convergence rate does not further improve with the solution regularity index when it is beyond a certain range. In this work, we present a refined convergence rate analysis of SGD, and prove that saturation actually does not occur if the initial stepsize of the schedule is sufficiently small. Several numerical experiments are provided to complement the analysis.
Motivated by statistical inference problems in high-dimensional time series analysis, we derive non-asymptotic error bounds for Gaussian approximations of sums of high-dimensional dependent random vectors on hyper-rectangles, simple convex sets and sparsely convex sets. We investigate the quantitative effect of temporal dependence on the rates of convergence to normality over three different dependency frameworks ($\alpha$-mixing, $m$-dependent, and physical dependence measure). In particular, we establish new error bounds under the $\alpha$-mixing framework and derive faster rate over existing results under the physical dependence measure. To implement the proposed results in practical statistical inference problems, we also derive a data-driven parametric bootstrap procedure based on a kernel-type estimator for the long-run covariance matrices.
Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the spread of the influence from these seeds, and it has been widely investigated in the past two decades. In the canonical setting, the whole social network as well as its diffusion parameters is given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the set of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS), and present constant approximation algorithms for this task under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Comparing with prior solutions, our network inference algorithm requires weaker assumptions and does not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.
Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, such as quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a $ProbSparse$ Self-attention mechanism, which achieves $O(L \log L)$ in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.
Conventional neural autoregressive decoding commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decoding algorithm -- InDIGO -- which supports flexible sequence generation in arbitrary orders through insertion operations. We extend Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, including word order recovery, machine translation, image caption and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better performance compared to the conventional left-to-right generation. The generated sequences show that InDIGO adopts adaptive generation orders based on input information.
This work presents a region-growing image segmentation approach based on superpixel decomposition. From an initial contour-constrained over-segmentation of the input image, the image segmentation is achieved by iteratively merging similar superpixels into regions. This approach raises two key issues: (1) how to compute the similarity between superpixels in order to perform accurate merging and (2) in which order those superpixels must be merged together. In this perspective, we firstly introduce a robust adaptive multi-scale superpixel similarity in which region comparisons are made both at content and common border level. Secondly, we propose a global merging strategy to efficiently guide the region merging process. Such strategy uses an adpative merging criterion to ensure that best region aggregations are given highest priorities. This allows to reach a final segmentation into consistent regions with strong boundary adherence. We perform experiments on the BSDS500 image dataset to highlight to which extent our method compares favorably against other well-known image segmentation algorithms. The obtained results demonstrate the promising potential of the proposed approach.
This project addresses the problem of sentiment analysis in twitter; that is classifying tweets according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online micro-blogging and social-networking platform which allows users to write short status updates of maximum length 140 characters. It is a rapidly expanding service with over 200 million registered users - out of which 100 million are active users and half of them log on twitter on a daily basis - generating nearly 250 million tweets per day. Due to this large amount of usage we hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the tweets. Analysing the public sentiment is important for many applications such as firms trying to find out the response of their products in the market, predicting political elections and predicting socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional classifier for accurate and automatic sentiment classification of an unknown tweet stream.