This paper studies a basic notion of distributional shape known as orthounimodality (OU) and its use in shape-constrained distributionally robust optimization (DRO). As a key motivation, we argue how such type of DRO is well-suited to tackle multivariate extreme event estimation by giving statistically valid confidence bounds on target extremal probabilities. In particular, we explain how DRO can be used as a nonparametric alternative to conventional extreme value theory that extrapolates tails based on theoretical limiting distributions, which could face challenges in bias-variance control and other technical complications. We also explain how OU resolves the challenges in interpretability and robustness faced by existing distributional shape notions used in the DRO literature. Methodologically, we characterize the extreme points of the OU distribution class in terms of what we call OU sets and build a corresponding Choquet representation, which subsequently allows us to reduce OU-DRO into moment problems over infinite-dimensional random variables. We then develop, in the bivariate setting, a geometric approach to reduce such moment problems into finite dimension via a specially constructed variational problem designed to eliminate suboptimal solutions. Numerical results illustrate how our approach gives rise to valid and competitive confidence bounds for extremal probabilities.
In Trefftz discontinuous Galerkin methods a partial differential equation is discretized using discontinuous shape functions that are chosen to be elementwise in the kernel of the corresponding differential operator. We propose a new variant, the embedded Trefftz discontinuous Galerkin method, which is the Galerkin projection of an underlying discontinuous Galerkin method onto a subspace of Trefftz-type. The subspace can be described in a very general way and to obtain it no Trefftz functions have to be calculated explicitly, instead the corresponding embedding operator is constructed. In the simplest cases the method recovers established Trefftz discontinuous Galerkin methods. But the approach allows to conveniently extend to general cases, including inhomogeneous sources and non-constant coefficient differential operators. We introduce the method, discuss implementational aspects and explore its potential on a set of standard PDE problems. Compared to standard discontinuous Galerkin methods we observe a severe reduction of the globally coupled unknowns in all considered cases reducing the corresponding computing time significantly. Moreover, for the Helmholtz problem we even observe an improved accuracy similar to Trefftz discontinuous Galerkin methods based on plane waves.
Dyadic data is often encountered when quantities of interest are associated with the edges of a network. As such it plays an important role in statistics, econometrics and many other data science disciplines. We consider the problem of uniformly estimating a dyadic Lebesgue density function, focusing on nonparametric kernel-based estimators which take the form of U-process-like dyadic empirical processes. We provide uniform point estimation and distributional results for the dyadic kernel density estimator, giving valid and feasible procedures for robust uniform inference. Our main contributions include the minimax-optimal uniform convergence rate of the dyadic kernel density estimator, along with strong approximation results for the associated standardized $t$-process. A consistent variance estimator is introduced in order to obtain analogous results for the Studentized $t$-process, enabling the construction of provably valid and feasible uniform confidence bands for the unknown density function. A crucial feature of U-process-like dyadic empirical processes is that they may be "degenerate" at some or possibly all points in the support of the data, a property making our uniform analysis somewhat delicate. Nonetheless we show formally that our proposed methods for uniform inference remain robust to the potential presence of such unknown degenerate points. For the purpose of implementation, we discuss uniform inference procedures based on positive semi-definite covariance estimators, mean squared error optimal bandwidth selectors and robust bias-correction methods. We illustrate the empirical finite-sample performance of our robust inference methods in a simulation study. Our technical results concerning strong approximations and maximal inequalities are of potential independent interest.
We study nonparametric Bayesian models for reversible multi-dimensional diffusions with periodic drift. For continuous observation paths, reversibility is exploited to prove a general posterior contraction rate theorem for the drift gradient vector field under approximation-theoretic conditions on the induced prior for the invariant measure. The general theorem is applied to Gaussian priors and $p$-exponential priors, which are shown to converge to the truth at the minimax optimal rate over Sobolev smoothness classes in any dimension
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
Discovering causal structure among a set of variables is a fundamental problem in many empirical sciences. Traditional score-based casual discovery methods rely on various local heuristics to search for a Directed Acyclic Graph (DAG) according to a predefined score function. While these methods, e.g., greedy equivalence search, may have attractive results with infinite samples and certain model assumptions, they are usually less satisfactory in practice due to finite data and possible violation of assumptions. Motivated by recent advances in neural combinatorial optimization, we propose to use Reinforcement Learning (RL) to search for the DAG with the best scoring. Our encoder-decoder model takes observable data as input and generates graph adjacency matrices that are used to compute rewards. The reward incorporates both the predefined score function and two penalty terms for enforcing acyclicity. In contrast with typical RL applications where the goal is to learn a policy, we use RL as a search strategy and our final output would be the graph, among all graphs generated during training, that achieves the best reward. We conduct experiments on both synthetic and real datasets, and show that the proposed approach not only has an improved search ability but also allows a flexible score function under the acyclicity constraint.
Graph neural networks (GNNs) are effective machine learning models for various graph learning problems. Despite their empirical successes, the theoretical limitations of GNNs have been revealed recently. Consequently, many GNN models have been proposed to overcome these limitations. In this survey, we provide a comprehensive overview of the expressive power of GNNs and provably powerful variants of GNNs.
User behavior data in recommender systems are driven by the complex interactions of many latent factors behind the users' decision making processes. The factors are highly entangled, and may range from high-level ones that govern user intentions, to low-level ones that characterize a user's preference when executing an intention. Learning representations that uncover and disentangle these latent factors can bring enhanced robustness, interpretability, and controllability. However, learning such disentangled representations from user behavior is challenging, and remains largely neglected by the existing literature. In this paper, we present the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE) for learning disentangled representations from user behavior. Our approach achieves macro disentanglement by inferring the high-level concepts associated with user intentions (e.g., to buy a shirt or a cellphone), while capturing the preference of a user regarding the different concepts separately. A micro-disentanglement regularizer, stemming from an information-theoretic interpretation of VAEs, then forces each dimension of the representations to independently reflect an isolated low-level factor (e.g., the size or the color of a shirt). Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. We further demonstrate that the learned representations are interpretable and controllable, which can potentially lead to a new paradigm for recommendation where users are given fine-grained control over targeted aspects of the recommendation lists.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.