Graph exploration is one of the fundamental tasks performed by a mobile agent in a graph. An $n$-node graph has unlabeled nodes, and all ports at any node of degree $d$ are arbitrarily numbered $0,\dots, d-1$. A mobile agent, initially situated at some starting node $v$, has to visit all nodes of the graph and stop. In the absence of any initial knowledge of the graph the task of deterministic exploration is often impossible. On the other hand, for some families of graphs it is possible to design deterministic exploration algorithms working for any graph of the family. We call such families of graphs {\em explorable}. Examples of explorable families are all finite families of graphs, as well as the family of all trees. In this paper we study the problem of which families of graphs are explorable. We characterize all such families, and then ask the question whether there exists a universal deterministic algorithm that, given an explorable family of graphs, explores any graph of this family, without knowing which graph of the family is being explored. The answer to this question turns out to depend on how the explorable family is given to the hypothetical universal algorithm. If the algorithm can get the answer to any yes/no question about the family, then such a universal algorithm can be constructed. If, on the other hand, the algorithm can be only given an algorithmic description of the input explorable family, then such a universal deterministic algorithm does not exist.
We consider the problem of learning functions in the $\mathcal{F}_{p,\pi}$ and Barron spaces, which are natural function spaces that arise in the high-dimensional analysis of random feature models (RFMs) and two-layer neural networks. Through a duality analysis, we reveal that the approximation and estimation of these spaces can be considered equivalent in a certain sense. This enables us to focus on the easier problem of approximation and estimation when studying the generalization of both models. The dual equivalence is established by defining an information-based complexity that can effectively control estimation errors. Additionally, we demonstrate the flexibility of our duality framework through comprehensive analyses of two concrete applications. The first application is to study learning functions in $\mathcal{F}_{p,\pi}$ with RFMs. We prove that the learning does not suffer from the curse of dimensionality as long as $p>1$, implying RFMs can work beyond the kernel regime. Our analysis extends existing results [CMM21] to the noisy case and removes the requirement of overparameterization. The second application is to investigate the learnability of reproducing kernel Hilbert space (RKHS) under the $L^\infty$ metric. We derive both lower and upper bounds of the minimax estimation error by using the spectrum of the associated kernel. We then apply these bounds to dot-product kernels and analyze how they scale with the input dimension. Our results suggest that learning with ReLU (random) features is generally intractable in terms of reaching high uniform accuracy.
The study of partial differential equations (PDE) through the framework of deep learning emerged a few years ago leading to the impressive approximations of simple dynamics. Graph neural networks (GNN) turned out to be very useful in those tasks by allowing the treatment of unstructured data often encountered in the field of numerical resolutions of PDE. However, the resolutions of harder PDE such as Navier-Stokes equations are still a challenging task and most of the work done on the latter concentrate either on simulating the flow around simple geometries or on qualitative results that looks physical for design purpose. In this study, we try to leverage the work done on deep learning for PDE and GNN by proposing an adaptation of a known architecture in order to tackle the task of approximating the solution of the two-dimensional steady-state incompressible Navier-Stokes equations over different airfoil geometries. In addition to that, we test our model not only on its performance over the volume but also on its performance to approximate surface quantities such as the wall shear stress or the isostatic pressure leading to the inference of global coefficients such as the lift and the drag of our airfoil in order to allow design exploration. This work takes place in a longer project that aims to approximate three dimensional steady-state solutions over industrial geometries.
Minimum Spanning Trees are a well-studied subset of graph problems. While classical algorithms have existed to solve these problems for decades, new variations and application areas are constantly being discovered. When dealing with large graph problems, however, memory constraints can often be limiting, especially when using these classical methods in memory restricted environments. In this work, we propose an augmentation of Prim's algorithm that can be empirically shown to solve MST problems with a reduction in auxiliary memory usage of over 90%, and a margin of error of less than 0.3%.
Given a set of probability measures $\mathcal{P}$ representing an agent's knowledge on the elements of a sigma-algebra $\mathcal{F}$, we can compute upper and lower bounds for the probability of any event $A\in\mathcal{F}$ of interest. A procedure generating a new assessment of beliefs is said to constrict $A$ if the bounds on the probability of $A$ after the procedure are contained in those before the procedure. It is well documented that (generalized) Bayes' updating does not allow for constriction, for all $A\in\mathcal{F}$. In this work, we show that constriction can take place with and without evidence being observed, and we characterize these possibilities.
Unit testing plays an essential role in detecting bugs in functionally-discrete program units (e.g., methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate tests with reasonable coverage, they are shown to exhibit low readability and still cannot be directly adopted by developers in practice. Recent work has shown the large potential of large language models (LLMs) in unit test generation. By being pre-trained on a massive developer-written code corpus, the models are capable of generating more human-like and meaningful test code. \chatgpt{}, the latest LLM that further incorporates instruction tuning and reinforcement learning, has exhibited outstanding performance in various domains. To date, it still remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT 's capability of unit test generation. In particular, we conduct both a quantitative analysis and a user study to systematically investigate the quality of its generated tests in terms of correctness, sufficiency, readability, and usability. We find that the tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures (mostly caused by incorrect assertions); but the passing tests generated by ChatGPT almost resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved.
A minimum storage regenerating (MSR) subspace family of $\mathbb{F}_q^{2m}$ is a set $\mathcal{S}$ of $m$-spaces in $\mathbb{F}_q^{2m}$ such that for any $m$-space $S$ in $\mathcal{S}$ there exists an element in $\mathrm{PGL}(2m, q)$ which maps $S$ to itself and fixes $\mathcal{S} \setminus \{ S \}$ pointwise. We show that an MSR subspace family of $2$-spaces in $\mathbb{F}_q^4$ has at most size $6$ with equality if and only if it is a particular subset of a Segre variety. This implies that an $(n, n-2, 4)$-MSR code has $n \leq 9$.
We study the basic statistical problem of testing whether normally distributed $n$-dimensional data has been truncated, i.e. altered by only retaining points that lie in some unknown truncation set $S \subseteq \mathbb{R}^n$. As our main algorithmic results, (1) We give a computationally efficient $O(n)$-sample algorithm that can distinguish the standard normal distribution $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary convex set $S$. (2) We give a different computationally efficient $O(n)$-sample algorithm that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary mixture of symmetric convex sets. These results stand in sharp contrast with known results for learning or testing convex bodies with respect to the normal distribution or learning convex-truncated normal distributions, where state-of-the-art algorithms require essentially $n^{\sqrt{n}}$ samples. An easy argument shows that no finite number of samples suffices to distinguish $N(0,I_n)$ from an unknown and arbitrary mixture of general (not necessarily symmetric) convex sets, so no common generalization of results (1) and (2) above is possible. We also prove that any algorithm (computationally efficient or otherwise) that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown symmetric convex set must use $\Omega(n)$ samples. This shows that the sample complexity of each of our algorithms is optimal up to a constant factor.
We consider the problem of explaining the predictions of graph neural networks (GNNs), which otherwise are considered as black boxes. Existing methods invariably focus on explaining the importance of graph nodes or edges but ignore the substructures of graphs, which are more intuitive and human-intelligible. In this work, we propose a novel method, known as SubgraphX, to explain GNNs by identifying important subgraphs. Given a trained GNN model and an input graph, our SubgraphX explains its predictions by efficiently exploring different subgraphs with Monte Carlo tree search. To make the tree search more effective, we propose to use Shapley values as a measure of subgraph importance, which can also capture the interactions among different subgraphs. To expedite computations, we propose efficient approximation schemes to compute Shapley values for graph data. Our work represents the first attempt to explain GNNs via identifying subgraphs explicitly and directly. Experimental results show that our SubgraphX achieves significantly improved explanations, while keeping computations at a reasonable level.
Exploration-exploitation is a powerful and practical tool in multi-agent learning (MAL), however, its effects are far from understood. To make progress in this direction, we study a smooth analogue of Q-learning. We start by showing that our learning model has strong theoretical justification as an optimal model for studying exploration-exploitation. Specifically, we prove that smooth Q-learning has bounded regret in arbitrary games for a cost model that explicitly captures the balance between game and exploration costs and that it always converges to the set of quantal-response equilibria (QRE), the standard solution concept for games under bounded rationality, in weighted potential games with heterogeneous learning agents. In our main task, we then turn to measure the effect of exploration in collective system performance. We characterize the geometry of the QRE surface in low-dimensional MAL systems and link our findings with catastrophe (bifurcation) theory. In particular, as the exploration hyperparameter evolves over-time, the system undergoes phase transitions where the number and stability of equilibria can change radically given an infinitesimal change to the exploration parameter. Based on this, we provide a formal theoretical treatment of how tuning the exploration parameter can provably lead to equilibrium selection with both positive as well as negative (and potentially unbounded) effects to system performance.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.