We study the recursion-theoretic complexity of Positive Almost-Sure Termination ($\mathsf{PAST}$) in an imperative programming language with rational variables, bounded nondeterministic choice, and discrete probabilistic choice. A program terminates positive almost-surely if, for every scheduler, the program terminates almost-surely and the expected runtime to termination is finite. We show that $\mathsf{PAST}$ for our language is complete for the (lightface) co-analytic sets ($\Pi^1_1$-complete). This is in contrast to the related notions of Almost-Sure Termination ($\mathsf{AST}$) and Bounded Termination ($\mathsf{BAST}$), both of which are arithmetical ($\Pi^0_2$ and $\Sigma^0_2$ complete respectively). Our upper bound implies an effective procedure to reduce reasoning about probabilistic termination to non-probabilistic fair termination in a model with bounded nondeterminism, and to simple program termination in models with unbounded nondeterminism. Our lower bound shows the opposite: for every program with unbounded nondeterministic choice, there is an effectively computable probabilistic program with bounded choice such that the original program is terminating $iff$ the transformed program is $\mathsf{PAST}$. We show that every program has an effectively computable normal form, in which each probabilistic choice either continues or terminates execution immediately, each with probability $1/2$. For normal form programs, we provide a sound and complete proof rule for $\mathsf{PAST}$. Our proof rule uses transfinite ordinals. We show that reasoning about $\mathsf{PAST}$ requires transfinite ordinals up to $\omega^{CK}_1$; thus, existing techniques for probabilistic termination based on ranking supermartingales that map program states to reals do not suffice to reason about $\mathsf{PAST}$.
The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in multilingual speech recognition. Recent studies have attempted to address this setting by separating the modules for different languages to ensure distinct latent representations for languages. Some other methods considered the switching mechanism based on language identification. In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. This method selects those attention heads in a model which closely express language identities and then guided those heads to be correctly attended with their corresponding languages. The experiments on the Mandarin-English code-switching speech corpus show that the proposed approach achieves a 14.2% mixed error rate, surpassing state-of-the-art method, where only 5.6% additional parameters over Whisper are trained.
We study a truthful two-facility location problem in which a set of agents have private positions on the line of real numbers and known approval preferences over two facilities. Given the locations of the two facilities, the cost of an agent is the total distance from the facilities she approves. The goal is to decide where to place the facilities from a given finite set of candidate locations so as to (a) approximately optimize desired social objectives, and (b) incentivize the agents to truthfully report their private positions. We focus on the class of deterministic strategyproof mechanisms and pinpoint the ones with the best possible approximation ratio in terms of the social cost (i.e., the total cost of the agents) and the max cost. In particular, for the social cost, we show a tight bound of $1+\sqrt{2}$ when the preferences of the agents are homogeneous (i.e., all agents approve both facilities), and a tight bound of $3$ when the preferences might be heterogeneous. For the max cost, we show tight bounds of $2$ and $3$ for homogeneous and heterogeneous preferences, respectively.
We propose a novel framework DropTop that suppresses the shortcut bias in online continual learning (OCL) while being adaptive to the varying degree of the shortcut bias incurred by continuously changing environment. By the observed high-attention property of the shortcut bias, highly-activated features are considered candidates for debiasing. More importantly, resolving the limitation of the online environment where prior knowledge and auxiliary data are not ready, two novel techniques -- feature map fusion and adaptive intensity shifting -- enable us to automatically determine the appropriate level and proportion of the candidate shortcut features to be dropped. Extensive experiments on five benchmark datasets demonstrate that, when combined with various OCL algorithms, DropTop increases the average accuracy by up to 10.4% and decreases the forgetting by up to 63.2%.
Leveraging machine-learning methods to predict outcomes on some unlabeled datasets and then using these pseudo-outcomes in subsequent statistical inference is common in modern data analysis. Inference in this setting is often called post-prediction inference. We propose a novel, assumption-lean framework for inference under post-prediction setting, called \emph{Prediction De-Correlated inference} (PDC). Our approach can automatically adapt to any black-box machine-learning model and consistently outperforms supervised methods. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis support our theoretical results.
We describe two algorithms for multiplying n x n matrices using time and energy n^2 polylog(n) under basic models of classical physics. The first algorithm is for multiplying integer-valued matrices, and the second, quite different algorithm, is for Boolean matrix multiplication. We hope this work inspires a deeper consideration of physically plausible/realizable models of computing that might allow for algorithms which improve upon the runtimes and energy usages suggested by the parallel RAM model in which each operation requires one unit of time and one unit of energy.
Recent work by Dhulipala, Liu, Raskhodnikova, Shi, Shun, and Yu~\cite{DLRSSY22} initiated the study of the $k$-core decomposition problem under differential privacy. They show that approximate $k$-core numbers can be output while guaranteeing differential privacy, while only incurring a multiplicative error of $(2 +\eta)$ (for any constant $\eta >0$) and additive error of $\poly(\log(n))/\eps$. In this paper, we revisit this problem. Our main result is an $\eps$-edge differentially private algorithm for $k$-core decomposition which outputs the core numbers with no multiplicative error and $O(\text{log}(n)/\eps)$ additive error. This improves upon previous work by a factor of 2 in the multiplicative error, while giving near-optimal additive error. With a little additional work, this implies improved algorithms for densest subgraph and low out-degree ordering under differential privacy. For low out-degree ordering, we give an $\eps$-edge differentially private algorithm which outputs an implicit orientation such that the out-degree of each vertex is at most $d+O(\log{n}/{\eps})$, where $d$ is the degeneracy of the graph. This improves upon the best known guarantees for the problem by a factor of $4$ and gives near-optimal additive error. For densest subgraph, we give an $\eps$-edge differentially private algorithm outputting a subset of nodes that induces a subgraph of density at least ${D^*}/{2}-O(\text{log}(n)/\eps)$, where $D^*$ is the density for the optimal subgraph.
Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.
The inductive biases of graph representation learning algorithms are often encoded in the background geometry of their embedding space. In this paper, we show that general directed graphs can be effectively represented by an embedding model that combines three components: a pseudo-Riemannian metric structure, a non-trivial global topology, and a unique likelihood function that explicitly incorporates a preferred direction in embedding space. We demonstrate the representational capabilities of this method by applying it to the task of link prediction on a series of synthetic and real directed graphs from natural language applications and biology. In particular, we show that low-dimensional cylindrical Minkowski and anti-de Sitter spacetimes can produce equal or better graph representations than curved Riemannian manifolds of higher dimensions.
Federated Learning (FL) is a decentralized machine-learning paradigm, in which a global server iteratively averages the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly averaging their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such a prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. Inspired by the prior art, we propose a data-free knowledge distillation} approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
Cold-start problems are long-standing challenges for practical recommendations. Most existing recommendation algorithms rely on extensive observed data and are brittle to recommendation scenarios with few interactions. This paper addresses such problems using few-shot learning and meta learning. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. To accomplish this, we combine the scenario-specific learning with a model-agnostic sequential meta-learning and unify them into an integrated end-to-end framework, namely Scenario-specific Sequential Meta learner (or s^2 meta). By doing so, our meta-learner produces a generic initial model through aggregating contextual information from a variety of prediction tasks while effectively adapting to specific tasks by leveraging learning-to-learn knowledge. Extensive experiments on various real-world datasets demonstrate that our proposed model can achieve significant gains over the state-of-the-arts for cold-start problems in online recommendation. Deployment is at the Guess You Like session, the front page of the Mobile Taobao.