Motivated by the computation of the non-parametric maximum likelihood estimator (NPMLE) and the Bayesian posterior in statistics, this paper explores the problem of convex optimization over the space of all probability distributions. We introduce an implicit scheme, called the implicit KL proximal descent (IKLPD) algorithm, for discretizing a continuous-time gradient flow relative to the Kullback-Leibler divergence for minimizing a convex target functional. We show that IKLPD converges to a global optimum at a polynomial rate from any initialization; moreover, if the objective functional is strongly convex relative to the KL divergence, for example, when the target functional itself is a KL divergence as in the context of Bayesian posterior computation, IKLPD exhibits globally exponential convergence. Computationally, we propose a numerical method based on normalizing flow to realize IKLPD. Conversely, our numerical method can also be viewed as a new approach that sequentially trains a normalizing flow for minimizing a convex functional with a strong theoretical guarantee.
We present a method for estimating the maximal symmetry of a continuous regression function. Knowledge of such a symmetry can be used to significantly improve modelling by removing the modes of variation resulting from the symmetries. Symmetry estimation is carried out using hypothesis testing for invariance strategically over the subgroup lattice of a search group G acting on the feature space. We show that the estimation of the unique maximal invariant subgroup of G generalises useful tools from linear dimension reduction to a non linear context. We show that the estimation is consistent when the subgroup lattice chosen is finite, even when some of the subgroups themselves are infinite. We demonstrate the performance of this estimator in synthetic settings and apply the methods to two data sets: satellite measurements of the earth's magnetic field intensity; and the distribution of sunspots.
The modifiable areal unit problem in geography or the change-of-support (COS) problem in statistics demonstrates that the interpretation of spatial (or spatio-temporal) data analysis is affected by the choice of resolutions or geographical units used in the study. The ecological fallacy is one famous example of this phenomenon. Here we investigate the ecological fallacy associated with the COS problem for multivariate spatial data with the goal of providing a data-driven discretization criterion for the domain of interest that minimizes aggregation errors. The discretization is based on a novel multiscale metric, called the Multivariate Criterion for Aggregation Error (MVCAGE). Such multi-scale representations of an underlying multivariate process are often formulated in terms of basis expansions. We show that a particularly useful basis expansion in this context is the multivariate Karhunen-Lo`eve expansion (MKLE). We use the MKLE to build the MVCAGE loss function and use it within the framework of spatial clustering algorithms to perform optimal spatial aggregation. We demonstrate the effectiveness of our approach through simulation and through regionalization of county-level income and hospital quality data over the United States and prediction of ocean color in the coastal Gulf of Alaska.
This paper focuses on the algebraic theory underlying the study of the complexity and the algorithms for the Constraint Satisfaction Problem (CSP). We unify, simplify, and extend parts of the three approaches that have been developed to study the CSP over finite templates - absorption theory that was used to characterize CSPs solvable by local consistency methods (JACM'14), and Bulatov's and Zhuk's theories that were used for two independent proofs of the CSP Dichotomy Theorem (FOCS'17, JACM'20). As the first contribution we present an elementary theorem about primitive positive definability and use it to obtain the starting points of Bulatov's and Zhuk's proofs as corollaries. As the second contribution we propose and initiate a systematic study of minimal Taylor algebras. This class of algebras is broad enough so that it suffices to verify the CSP Dichotomy Theorem on this class only, but still is unusually well behaved. In particular, many concepts from the three approaches coincide in the class, which is in striking contrast with the general setting. We believe that the theory initiated in this paper will eventually result in a simple and more natural proof of the Dichotomy Theorem that employs a simpler and more efficient algorithm, and will help in attacking complexity questions in other CSP-related problems.
Click-Through Rate (CTR) prediction is a crucial task in online recommendation platforms as it involves estimating the probability of user engagement with advertisements or items by clicking on them. Given the availability of various services like online shopping, ride-sharing, food delivery, and professional services on commercial platforms, recommendation systems in these platforms are required to make CTR predictions across multiple domains rather than just a single domain. However, multi-domain click-through rate (MDCTR) prediction remains a challenging task in online recommendation due to the complex mutual influence between domains. Traditional MDCTR models typically encode domains as discrete identifiers, ignoring rich semantic information underlying. Consequently, they can hardly generalize to new domains. Besides, existing models can be easily dominated by some specific domains, which results in significant performance drops in the other domains (\ie the ``seesaw phenomenon``). In this paper, we propose a novel solution Uni-CTR to address the above challenges. Uni-CTR leverages a backbone Large Language Model (LLM) to learn layer-wise semantic representations that capture commonalities between domains. Uni-CTR also uses several domain-specific networks to capture the characteristics of each domain. Note that we design a masked loss strategy so that these domain-specific networks are decoupled from backbone LLM. This allows domain-specific networks to remain unchanged when incorporating new or removing domains, thereby enhancing the flexibility and scalability of the system significantly. Experimental results on three public datasets show that Uni-CTR outperforms the state-of-the-art (SOTA) MDCTR models significantly. Furthermore, Uni-CTR demonstrates remarkable effectiveness in zero-shot prediction. We have applied Uni-CTR in industrial scenarios, confirming its efficiency.
This paper advances theoretical understanding of infinite-dimensional geometrical properties associated with Bayesian inference. First, we introduce a novel class of infinite-dimensional Hamiltonian systems for saddle Hamiltonian functions whose domains are metric spaces. A flow of this system is generated by a Hamiltonian arc field, an analogue of Hamiltonian vector fields formulated based on (i) the first variation of Hamiltonian functions and (ii) the notion of arc fields that extends vector fields to metric spaces. We establish that this system obeys the conservation of energy. We derive a condition for the existence of the flow, which reduces to local Lipschitz continuity of the first variation under sufficient regularity. Second, we present a system of a Hamiltonian function, called the minimum free energy, whose domain is a metric space of negative log-likelihoods and probability measures. The difference of the posterior and the prior of Bayesian inference is characterised as the first variation of the minimum free energy. Our result shows that a transition from the prior to the posterior defines an arc field on a space of probability measures, which forms a Hamiltonian arc field together with another corresponding arc field on a space of negative log-likelihoods. This reveals the underlying invariance of the free energy behind the arc field.
This paper studies the problem of forecasting general stochastic processes using a path-dependent extension of the Neural Jump ODE (NJ-ODE) framework \citep{herrera2021neural}. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes, where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data. Moreover, we show that PD-NJ-ODE can be applied successfully to classical stochastic filtering problems and to limit order book (LOB) data.
We investigate two efficient time discretizations for the post-processing technique of discontinuous Galerkin (DG) methods to solve hyperbolic conservation laws. The post-processing technique, which is applied at the final time of the DG method, can enhance the accuracy of the original DG solution (spatial superconvergence). One main difficulty of the post-processing technique is that the spatial superconvergence after post-processing needs to be matched with proper temporary accuracy. If the semi-discretized system (ODE system after spatial discretization) is under-resolved in time, then the space superconvergence will be concealed. In this paper, we focus our investigation on the recently designed SDG method and derive its explicit scheme from a correction process based on the DG weak formulation. We also introduce another similar technique, namely the spectral deferred correction (SDC) method. A comparison is made among both proposed time discretization techniques with the standard third-order Runge-Kutta method through several numerical examples, to conclude that both the SDG and SDC methods are efficient time discretization techniques for exploiting the spatial superconvergence of the DG methods.
Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
We address the task of automatically scoring the competency of candidates based on textual features, from the automatic speech recognition (ASR) transcriptions in the asynchronous video job interview (AVI). The key challenge is how to construct the dependency relation between questions and answers, and conduct the semantic level interaction for each question-answer (QA) pair. However, most of the recent studies in AVI focus on how to represent questions and answers better, but ignore the dependency information and interaction between them, which is critical for QA evaluation. In this work, we propose a Hierarchical Reasoning Graph Neural Network (HRGNN) for the automatic assessment of question-answer pairs. Specifically, we construct a sentence-level relational graph neural network to capture the dependency information of sentences in or between the question and the answer. Based on these graphs, we employ a semantic-level reasoning graph attention network to model the interaction states of the current QA session. Finally, we propose a gated recurrent unit encoder to represent the temporal question-answer pairs for the final prediction. Empirical results conducted on CHNAT (a real-world dataset) validate that our proposed model significantly outperforms text-matching based benchmark models. Ablation studies and experimental results with 10 random seeds also show the effectiveness and stability of our models.