In this paper, we study the following nonlinear matrix decomposition (NMD) problem: given a sparse nonnegative matrix $X$, find a low-rank matrix $\Theta$ such that $X \approx f(\Theta)$, where $f$ is an element-wise nonlinear function. We focus on the case where $f(\cdot) = \max(0, \cdot)$, the rectified unit (ReLU) non-linear activation. We refer to the corresponding problem as ReLU-NMD. We first provide a brief overview of the existing approaches that were developed to tackle ReLU-NMD. Then we introduce two new algorithms: (1) aggressive accelerated NMD (A-NMD) which uses an adaptive Nesterov extrapolation to accelerate an existing algorithm, and (2) three-block NMD (3B-NMD) which parametrizes $\Theta = WH$ and leads to a significant reduction in the computational cost. We also propose an effective initialization strategy based on the nuclear norm as a proxy for the rank function. We illustrate the effectiveness of the proposed algorithms (available on gitlab) on synthetic and real-world data sets.
We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(\epsilon,\delta)$-differential privacy with \emph{strong (primal-dual) gap} rate of $\tilde O\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{n\epsilon}\big)$, where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with $O\big(\min\big\{\frac{n^2\epsilon^{1.5}}{\sqrt{d}}, n^{3/2}\big\}\big)$ gradient complexity, and $\tilde{O}(n)$ gradient complexity if the loss function is smooth. As a byproduct of our method, we develop a general algorithm that, given a black-box access to a subroutine satisfying a certain $\alpha$ primal-dual accuracy guarantee with respect to the empirical objective, gives a solution to the stochastic saddle point problem with a strong gap of $\tilde{O}(\alpha+\frac{1}{\sqrt{n}})$. We show that this $\alpha$-accuracy condition is satisfied by standard algorithms for the empirical saddle point problem such as the proximal point method and the stochastic gradient descent ascent algorithm. Further, we show that even for simple problems it is possible for an algorithm to have zero weak gap and suffer from $\Omega(1)$ strong gap. We also show that there exists a fundamental tradeoff between stability and accuracy. Specifically, we show that any $\Delta$-stable algorithm has empirical gap $\Omega\big(\frac{1}{\Delta n}\big)$, and that this bound is tight. This result also holds also more specifically for empirical risk minimization problems and may be of independent interest.
We study cut finite element discretizations of a Darcy interface problem based on the mixed finite element pairs $\textbf{RT}_0\times Q_0$, $\textbf{BDM}_1\times Q_0$, and $\textbf{RT}_1\times Q_1$. Here $Q_k$ is the space of discontinuous polynomial functions of degree k, $\textbf{RT}_{k}$ is the Raviart-Thomas space, and $\textbf{BDM}_k$ is the Brezzi-Douglas-Marini space. We show that the standard ghost penalty stabilization, often added in the weak forms of cut finite element methods for stability and control of the condition number of the resulting linear system matrix, destroys the divergence-free property of the considered element pairs. Therefore, we propose two corrections to the standard stabilization strategy: using macro-elements and new stabilization terms for the pressure. By decomposing the computational mesh into macro-elements and applying ghost penalty terms only on interior edges of macro-elements, stabilization is active only where needed. By modifying the standard stabilization terms for the pressure we recover the optimal approximation of the divergence without losing control of the condition number of the linear system matrix. We derive a priori error estimates for the proposed unfitted finite element discretization based on $\textbf{RT}_k\times Q_k$, $k\geq 0$. Numerical experiments indicate that with the new method we have 1) optimal rates of convergence of the approximate velocity and pressure; 2) well-posed linear systems where the condition number of the system matrix scales as it does for fitted finite element discretizations; 3) optimal rates of convergence of the approximate divergence with pointwise divergence-free approximations of solenoidal velocity fields. All three properties hold independently of how the interface is positioned relative to the computational mesh.
Singularly perturbed problems present inherent difficulty due to the presence of a thin boundary layer in its solution. To overcome this difficulty, we propose using deep operator networks (DeepONets), a method previously shown to be effective in approximating nonlinear operators between infinite-dimensional Banach spaces. In this paper, we demonstrate for the first time the application of DeepONets to one-dimensional singularly perturbed problems, achieving promising results that suggest their potential as a robust tool for solving this class of problems. We consider the convergence rate of the approximation error incurred by the operator networks in approximating the solution operator, and examine the generalization gap and empirical risk, all of which are shown to converge uniformly with respect to the perturbation parameter. By utilizing Shishkin mesh points as locations of the loss function, we conduct several numerical experiments that provide further support for the effectiveness of operator networks in capturing the singular boundary layer behavior.
The majority of machine learning methods can be regarded as the minimization of an unavailable risk function. To optimize the latter, given samples provided in a streaming fashion, we define a general stochastic Newton algorithm and its weighted average version. In several use cases, both implementations will be shown not to require the inversion of a Hessian estimate at each iteration, but a direct update of the estimate of the inverse Hessian instead will be favored. This generalizes a trick introduced in [2] for the specific case of logistic regression, by directly updating the estimate of the inverse Hessian. Under mild assumptions such as local strong convexity at the optimum, we establish almost sure convergences and rates of convergence of the algorithms, as well as central limit theorems for the constructed parameter estimates. The unified framework considered in this paper covers the case of linear, logistic or softmax regressions to name a few. Numerical experiments on simulated data give the empirical evidence of the pertinence of the proposed methods, which outperform popular competitors particularly in case of bad initializa-tions.
The extremal theory of forbidden 0--1 matrices studies the asymptotic growth of the function $\mathrm{Ex}(P,n)$, which is the maximum weight of a matrix $A\in\{0,1\}^{n\times n}$ whose submatrices avoid a fixed pattern $P\in\{0,1\}^{k\times l}$. This theory has been wildly successful at resolving problems in combinatorics, discrete and computational geometry, structural graph theory, and the analysis of data structures, particularly corollaries of the dynamic optimality conjecture. All these applications use acyclic patterns, meaning that when $P$ is regarded as the adjacency matrix of a bipartite graph, the graph is acyclic. The biggest open problem in this area is to bound $\mathrm{Ex}(P,n)$ for acyclic $P$. Prior results have only ruled out the strict $O(n\log n)$ bound conjectured by Furedi and Hajnal. It is consistent with prior results that $\forall P. \mathrm{Ex}(P,n)\leq n\log^{1+o(1)} n$, and also consistent that $\forall \epsilon>0.\exists P. \mathrm{Ex}(P,n) \geq n^{2-\epsilon}$. In this paper we establish a stronger lower bound on the extremal functions of acyclic $P$. Specifically, we give a new construction of relatively dense 0--1 matrices with $\Theta(n(\log n/\log\log n)^t)$ 1s that avoid an acyclic $X_t$. Pach and Tardos have conjectured that this type of result is the best possible, i.e., no acyclic $P$ exists for which $\mathrm{Ex}(P,n)\geq n(\log n)^{\omega(1)}$.
We prove the first polynomial separation between randomized and deterministic time-space tradeoffs of multi-output functions. In particular, we present a total function that on the input of $n$ elements in $[n]$, outputs $O(n)$ elements, such that: (1) There exists a randomized oblivious algorithm with space $O(\log n)$, time $O(n\log n)$ and one-way access to randomness, that computes the function with probability $1-O(1/n)$; (2) Any deterministic oblivious branching program with space $S$ and time $T$ that computes the function must satisfy $T^2S\geq\Omega(n^{2.5}/\log n)$. This implies that logspace randomized algorithms for multi-output functions cannot be black-box derandomized without an $\widetilde{\Omega}(n^{1/4})$ overhead in time. Since previously all the polynomial time-space tradeoffs of multi-output functions are proved via the Borodin-Cook method, which is a probabilistic method that inherently gives the same lower bound for randomized and deterministic branching programs, our lower bound proof is intrinsically different from previous works. We also examine other natural candidates for proving such separations, and show that any polynomial separation for these problems would resolve the long-standing open problem of proving $n^{1+\Omega(1)}$ time lower bound for decision problems with $\mathrm{polylog}(n)$ space.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.
Deep convolutional neural networks (CNNs) have recently achieved great success in many visual recognition tasks. However, existing deep neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past few years, tremendous progress has been made in this area. In this paper, we survey the recent advanced techniques for compacting and accelerating CNNs model developed. These techniques are roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Methods of parameter pruning and sharing will be described at the beginning, after that the other techniques will be introduced. For each scheme, we provide insightful analysis regarding the performance, related applications, advantages, and drawbacks etc. Then we will go through a few very recent additional successful methods, for example, dynamic capacity networks and stochastic depths networks. After that, we survey the evaluation matrix, the main datasets used for evaluating the model performance and recent benchmarking efforts. Finally, we conclude this paper, discuss remaining challenges and possible directions on this topic.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.