Regularized m-estimators are widely used due to their ability of recovering a low-dimensional model in high-dimensional scenarios. Some recent efforts on this subject focused on creating a unified framework for establishing oracle bounds, and deriving conditions for support recovery. Under this same framework, we propose a new Generalized Information Criteria (GIC) that takes into consideration the sparsity pattern one wishes to recover. We obtain non-asymptotic model selection bounds and sufficient conditions for model selection consistency of the GIC. Furthermore, we show that the GIC can also be used for selecting the regularization parameter within a regularized $m$-estimation framework, which allows practical use of the GIC for model selection in high-dimensional scenarios. We provide examples of group LASSO in the context of generalized linear regression and low rank matrix regression.
To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast amd irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different sites. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC, SEN, KAPPA and HD95 of PDPNet were improved 3.63\%, 8.19\%, 5.52\%, and 3.66\% respectively. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregual tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance.
We introduce two new stochastic conjugate frameworks for a class of nonconvex and possibly also nonsmooth optimization problems. These frameworks are built upon Stochastic Recursive Gradient Algorithm (SARAH) and we thus refer to them as Acc-Prox-CG-SARAH and Acc-Prox-CG-SARAH-RS, respectively. They are efficiently accelerated, easy to implement, tune free and can be smoothly extended and modified. We devise a deterministic restart scheme for stochastic optimization and apply it in our second stochastic conjugate framework, which serves the key difference between the two approaches. In addition, we apply the ProbAbilistic Gradient Estimator (PAGE) and further develop a practical variant, denoted as Acc-Prox-CG-SARAH-ST, in order to reduce potential computational overhead. We provide comprehensive and rigorous convergence analysis for all three approaches and establish linear convergence rates for unconstrained minimization problem with nonconvex and nonsmooth objective functions. Experiments have demonstrated that Acc-Prox-CG-SARAH and Acc-Prox-CG-SARAH-RS both outperform state-of-art methods consistently and Acc-Prox-CG-SARAH-ST can as well achieve comparable convergence speed. In terms of theory and experiments, we verify the strong computational efficiency of the deterministic restart scheme in stochastic optimization methods.
We propose a differentiable vertex fitting algorithm that can be used for secondary vertex fitting, and that can be seamlessly integrated into neural networks for jet flavour tagging. Vertex fitting is formulated as an optimization problem where gradients of the optimized solution vertex are defined through implicit differentiation and can be passed to upstream or downstream neural network components for network training. More broadly, this is an application of differentiable programming to integrate physics knowledge into neural network models in high energy physics. We demonstrate how differentiable secondary vertex fitting can be integrated into larger transformer-based models for flavour tagging and improve heavy flavour jet classification.
Sequential transfer optimization (STO), which aims to improve the optimization performance on a task of interest by exploiting the knowledge captured from several previously-solved optimization tasks stored in a database, has been gaining increasing research attention over the years. However, despite the remarkable advances in algorithm design, the development of a systematic benchmark suite for comprehensive comparisons of STO algorithms received far less attention. Existing test problems are either simply generated by assembling other benchmark functions or extended from specific practical problems with limited scalability. The relationships between the optimal solutions of the source and target tasks in these problems are also often manually configured, limiting their ability to model different similarity relationships presented in real-world problems. Consequently, the good performance achieved by an algorithm on these problems might be biased and hard to be generalized to other problems. In light of the above, in this study, we first introduce four concepts for characterizing STO problems and present an important problem feature, namely similarity distribution, which quantitatively delineates the relationship between the optima of the source and target tasks. Then, we present the general design guidelines of STO problems and a particular STO problem generator with good scalability. Specifically, the similarity distribution of a problem can be easily customized, enabling a continuous spectrum of representation of the diverse similarity relationships of real-world problems. Lastly, a benchmark suite with 12 STO problems featured by a variety of customized similarity relationships is developed using the proposed generator. The source code of the problem generator is available at //github.com/XmingHsueh/STOP-G.
We provide a variety of lower bounds for the well-known shortcut set problem: how much can one decrease the diameter of a directed graph on $n$ vertices and $m$ edges by adding $O(n)$ or $O(m)$ of shortcuts from the transitive closure of the graph. Our results are based on a vast simplification of the recent construction of Bodwin and Hoppenworth [FOCS 2023] which was used to show an $\widetilde{\Omega}(n^{1/4})$ lower bound for the $O(n)$-sized shortcut set problem. We highlight that our simplification completely removes the use of the convex sets by B\'ar\'any and Larman [Math. Ann. 1998] used in all previous lower bound constructions. Our simplification also removes the need for randomness and further removes some log factors. This allows us to generalize the construction to higher dimensions, which in turn can be used to show the following results. For $O(m)$-sized shortcut sets, we show an $\Omega(n^{1/5})$ lower bound, improving on the previous best $\Omega(n^{1/8})$ lower bound. For all $\varepsilon > 0$, we show that there exists a $\delta > 0$ such that there are $n$-vertex $O(n)$-edge graphs $G$ where adding any shortcut set of size $O(n^{2-\varepsilon})$ keeps the diameter of $G$ at $\Omega(n^\delta)$. This improves the sparsity of the constructed graph compared to a known similar result by Hesse [SODA 2003]. We also consider the sourcewise setting for shortcut sets: given a graph $G=(V,E)$, a set $S\subseteq V$, how much can we decrease the sourcewise diameter of $G$, $\max_{(s, v) \in S \times V, \text{dist}(s, v) < \infty} \text{dist}(s,v)$ by adding a set of edges $H$ from the transitive closure of $G$? We show that for any integer $d \ge 2$, there exists a graph $G=(V, E)$ on $n$ vertices and $S \subseteq V$ with $|S| = \widetilde{\Theta}(n^{3/(d+3)})$, such that when adding $O(n)$ or $O(m)$ shortcuts, the sourcewise diameter is $\widetilde{\Omega}(|S|^{1/3})$.
The ability to measure the satisfaction of (groups of) voters is a crucial prerequisite for formulating proportionality axioms in approval-based participatory budgeting elections. Two common - but very different - ways to measure the satisfaction of a voter consider (i) the number of approved projects and (ii) the total cost of approved projects, respectively. In general, it is difficult to decide which measure of satisfaction best reflects the voters' true utilities. In this paper, we study proportionality axioms with respect to large classes of approval-based satisfaction functions. We establish logical implications among our axioms and related notions from the literature, and we ask whether outcomes can be achieved that are proportional with respect to more than one satisfaction function. We show that this is impossible for the two commonly used satisfaction functions when considering proportionality notions based on extended justified representation, but achievable for a notion based on proportional justified representation. For the latter result, we introduce a strengthening of priceability and show that it is satisfied by several polynomial-time computable rules, including the Method of Equal Shares and Phragm\`en's sequential rule.
Generating mathematical equations from natural language requires an accurate understanding of the relations among math expressions. Existing approaches can be broadly categorized into token-level and expression-level generation. The former treats equations as a mathematical language, sequentially generating math tokens. Expression-level methods generate each expression one by one. However, each expression represents a solving step, and there naturally exist parallel or dependent relations between these steps, which are ignored by current sequential methods. Therefore, we integrate tree structure into the expression-level generation and advocate an expression tree decoding strategy. To generate a tree with expression as its node, we employ a layer-wise parallel decoding strategy: we decode multiple independent expressions (leaf nodes) in parallel at each layer and repeat parallel decoding layer by layer to sequentially generate these parent node expressions that depend on others. Besides, a bipartite matching algorithm is adopted to align multiple predictions with annotations for each layer. Experiments show our method outperforms other baselines, especially for these equations with complex structures.
Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space - usually via continuous relaxation or dequantization - and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while being significantly faster to train.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Embedding entities and relations into a continuous multi-dimensional vector space have become the dominant method for knowledge graph embedding in representation learning. However, most existing models ignore to represent hierarchical knowledge, such as the similarities and dissimilarities of entities in one domain. We proposed to learn a Domain Representations over existing knowledge graph embedding models, such that entities that have similar attributes are organized into the same domain. Such hierarchical knowledge of domains can give further evidence in link prediction. Experimental results show that domain embeddings give a significant improvement over the most recent state-of-art baseline knowledge graph embedding models.