Many major works in social science employ matching to make causal conclusions, but different matches on the same data may produce different treatment effect estimates, even when they achieve similar balance or minimize the same loss function. We discuss reasons and consequences of this problem. We present evidence of this problem by replicating ten papers that use matching and we find that different popular matching algorithms produce inconsistent results. We introduce Matching Bounds: a finite-sample, nonstochastic method that allows analysts to know whether a matched sample that produces different results with the same levels of balance and overall match quality could be obtained from their data. We apply Matching Bounds to a replication of two studies and show that in one case results are robust to this issue and in another they are not.
We propose an online learning algorithm for a class of machine learning models under a separable stochastic approximation framework. The essence of our idea lies in the observation that certain parameters in the models are easier to optimize than others. In this paper, we focus on models where some parameters have a linear nature, which is common in machine learning. In one routine of the proposed algorithm, the linear parameters are updated by the recursive least squares (RLS) algorithm, which is equivalent to a stochastic Newton method; then, based on the updated linear parameters, the nonlinear parameters are updated by the stochastic gradient method (SGD). The proposed algorithm can be understood as a stochastic approximation version of block coordinate gradient descent approach in which one part of the parameters is updated by a second-order SGD method while the other part is updated by a first-order SGD. Global convergence of the proposed online algorithm for non-convex cases is established in terms of the expected violation of a first-order optimality condition. Numerical experiments have shown that the proposed method accelerates convergence significantly and produces more robust training and test performance when compared to other popular learning algorithms. Moreover, our algorithm is less sensitive to the learning rate and outperforms the recently proposed slimTrain algorithm. The code has been uploaded to GitHub for validation.
Users online tend to join polarized groups of like-minded peers around shared narratives, forming echo chambers. The echo chamber effect and opinion polarization may be driven by several factors including human biases in information consumption and personalized recommendations produced by feed algorithms. Until now, studies have mainly used opinion dynamic models to explore the mechanisms behind the emergence of polarization and echo chambers. The objective was to determine the key factors contributing to these phenomena and identify their interplay. However, the validation of model predictions with empirical data still displays two main drawbacks: lack of systematicity and qualitative analysis. In our work, we bridge this gap by providing a method to numerically compare the opinion distributions obtained from simulations with those measured on social media. To validate this procedure, we develop an opinion dynamic model that takes into account the interplay between human and algorithmic factors. We subject our model to empirical testing with data from diverse social media platforms and benchmark it against two state-of-the-art models. To further enhance our understanding of social media platforms, we provide a synthetic description of their characteristics in terms of the model's parameter space. This representation has the potential to facilitate the refinement of feed algorithms, thus mitigating the detrimental effects of extreme polarization on online discourse.
We study various novel complexity measures for two-sided matching mechanisms, applied to the two canonical strategyproof matching mechanisms, Deferred Acceptance (DA) and Top Trading Cycles (TTC). Our metrics are designed to capture the complexity of various structural (rather than computational) concerns, in particular ones of recent interest from economics. We consider a canonical, flexible approach to formalizing our questions: define a protocol or data structure performing some task, and bound the number of bits that it requires. Our results apply this approach to four questions of general interest; for matching applicants to institutions, we ask: (1) How can one applicant affect the outcome matching? (2) How can one applicant affect another applicant's set of options? (3) How can the outcome matching be represented / communicated? (4) How can the outcome matching be verified? We prove that DA and TTC are comparable in complexity under questions (1) and (4), giving new tight lower-bound constructions and new verification protocols. Under questions (2) and (3), we prove that TTC is more complex than DA. For question (2), we prove this by giving a new characterization of which institutions are removed from each applicant's set of options when a new applicant is added in DA; this characterization may be of independent interest. For question (3), our result gives lower bounds proving the tightness of existing constructions for TTC. This shows that the relationship between the matching and the priorities is more complex in TTC than in DA, formalizing previous intuitions from the economics literature. Together, our results complement recent work that models the complexity of observing strategyproofness and shows that DA is more complex than TTC. This emphasizes that diverse considerations must factor into gauging the complexity of matching mechanisms.
This paper considers the problem of inference in cluster randomized trials where treatment status is determined according to a "matched pairs'' design. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the cluster; by a "matched pairs'' design we mean that a sample of clusters is paired according to baseline, cluster-level covariates and, within each pair, one cluster is selected at random for treatment. We study the large-sample behavior of a weighted difference-in-means estimator and derive two distinct sets of results depending on if the matching procedure does or does not match on cluster size. We then propose a single variance estimator which is consistent in either regime. Combining these results establishes the asymptotic exactness of tests based on these estimators. Next, we consider the properties of two common testing procedures based on t-tests constructed from linear regressions, and argue that both are generally conservative in our framework. We additionally study the behavior of a randomization test which permutes the treatment status for clusters within pairs, and establish its finite-sample and asymptotic validity for testing specific null hypotheses. Finally, we propose a covariate-adjusted estimator which adjusts for additional baseline covariates not used for treatment assignment, and establish conditions under which such an estimator leads to improvements in precision. A simulation study confirms the practical relevance of our theoretical results.
In the recent literature on estimating heterogeneous treatment effects, each proposed method makes its own set of restrictive assumptions about the intervention's effects and which subpopulations to explicitly estimate. Moreover, the majority of the literature provides no mechanism to identify which subpopulations are the most affected--beyond manual inspection--and provides little guarantee on the correctness of the identified subpopulations. Therefore, we propose Treatment Effect Subset Scan (TESS), a new method for discovering which subpopulation in a randomized experiment is most significantly affected by a treatment. We frame this challenge as a pattern detection problem where we efficiently maximize a nonparametric scan statistic (a measure of the conditional quantile treatment effect) over subpopulations. Furthermore, we identify the subpopulation which experiences the largest distributional change as a result of the intervention, while making minimal assumptions about the intervention's effects or the underlying data generating process. In addition to the algorithm, we demonstrate that under the sharp null hypothesis of no treatment effect, the asymptotic Type I and II error can be controlled, and provide sufficient conditions for detection consistency--i.e., exact identification of the affected subpopulation. Finally, we validate the efficacy of the method by discovering heterogeneous treatment effects in simulations and in real-world data from a well-known program evaluation study.
Letter-to-letter transducers are a standard formalism for modeling reactive systems. Often, two transducers that model similar systems differ locally from one another, by behaving similarly, up to permutations of the input and output letters within "rounds". In this work, we introduce and study notions of simulation by rounds and equivalence by rounds of transducers. In our setting, words are partitioned to consecutive subwords of a fixed length $k$, called rounds. Then, a transducer $\mathcal{T}_1$ is $k$-round simulated by transducer $\mathcal{T}_2$ if, intuitively, for every input word $x$, we can permute the letters within each round in $x$, such that the output of $\mathcal{T}_2$ on the permuted word is itself a permutation of the output of $\mathcal{T}_1$ on $x$. Finally, two transducers are $k$-round equivalent if they simulate each other. We solve two main decision problems, namely whether $\mathcal{T}_2$ $k$-round simulates $\mathcal{T}_1$ (1) when $k$ is given as input, and (2) for an existentially quantified $k$. We demonstrate the usefulness of the definitions by applying them to process symmetry: a setting in which a permutation in the identities of processes in a multi-process system naturally gives rise to two transducers, whose $k$-round equivalence corresponds to stability against such permutations.
It is realized that existing powerful tests of goodness-of-fit are all based on sorted uniforms and, consequently, can suffer from the confounded effect of different locations and various signal frequencies in the deviations of the distributions under the alternative hypothesis from those under the null. This paper proposes circularly symmetric tests that are obtained by circularizing reweighted Anderson-Darling tests, with the focus on the circularized versions of Anderson-Darling and Zhang test statistics. Two specific types of circularization are considered, one is obtained by taking the average of the corresponding so-called scan test statistics and the other by using the maximum. To a certain extent, this circularization technique effectively eliminates the location effect and allows the weights to focus on the various signal frequencies. A limited but arguably convincing simulation study on finite-sample performance demonstrates that the circularized Zhang method outperforms the circularized Anderson-Darling and that the circularized tests outperform their parent methods. Large-sample theoretical results are also obtained for the average type of circularization. The results show that both the circularized Anderson-Darling and circularized Zhang have asymptotic distributions that are a weighted sum of an infinite number of independent squared standard normal random variables. In addition, the kernel matrices and functions are circulant. As a result, asymptotic approximations are computationally efficient via the fast Fourier transform.
Chatterjee, Gmyr, and Pandurangan [PODC 2020] recently introduced the notion of awake complexity for distributed algorithms, which measures the number of rounds in which a node is awake. In the other rounds, the node is sleeping and performs no computation or communication. Measuring the number of awake rounds can be of significance in many settings of distributed computing, e.g., in sensor networks where energy consumption is of concern. In that paper, Chatterjee et al. provide an elegant randomized algorithm for the Maximal Independent Set (MIS) problem that achieves an $O(1)$ node-averaged awake complexity. That is, the average awake time among the nodes is $O(1)$ rounds. However, to achieve that, the algorithm sacrifices the more standard round complexity measure from the well-known $O(\log n)$ bound of MIS, due to Luby [STOC'85], to $O(\log^{3.41} n)$ rounds. Our first contribution is to present a simple randomized distributed MIS algorithm that, with high probability, has $O(1)$ node-averaged awake complexity and $O(\log n)$ worst-case round complexity. Our second, and more technical contribution, is to show algorithms with the same $O(1)$ node-averaged awake complexity and $O(\log n)$ worst-case round complexity for $(1+\varepsilon)$-approximation of maximum matching and $(2+\varepsilon)$-approximation of minimum vertex cover, where $\varepsilon$ denotes an arbitrary small positive constant.
Evaluating the treatment effects has become an important topic for many applications. However, most existing literature focuses mainly on the average treatment effects. When the individual effects are heavy-tailed or have outlier values, not only may the average effect not be appropriate for summarizing the treatment effects, but also the conventional inference for it can be sensitive and possibly invalid due to poor large-sample approximations. In this paper we focus on quantiles of individual effects, which can be more robust measures of treatment effects in the presence of extreme individual effects. Moreover, our inference for quantiles of individual effects are purely randomization-based, which avoids any distributional assumption on the units. We first consider inference for stratified randomized experiments, extending the recent work of Caughey et al. (2021). The calculation of valid $p$-values for testing null hypotheses on quantiles of individual effects involves linear integer programming, which is generally NP hard. To overcome this issue, we propose a greedy algorithm with a certain optimal transformation, which has much lower computational cost, still leads to valid $p$-values and is less conservative than the usual relaxation by dropping the integer constraint. We then extend our approach to matched observational studies and propose sensitivity analysis to investigate to what extent our inference on quantiles of individual effects is robust to unmeasured confounding. Both the randomization inference and sensitivity analysis are simultaneously valid for all quantiles of individual effects, which are actually free lunches added to the conventional analysis assuming constant effects. Furthermore, the inference results can be easily visualized and interpreted.
In the domain generalization literature, a common objective is to learn representations independent of the domain after conditioning on the class label. We show that this objective is not sufficient: there exist counter-examples where a model fails to generalize to unseen domains even after satisfying class-conditional domain invariance. We formalize this observation through a structural causal model and show the importance of modeling within-class variations for generalization. Specifically, classes contain objects that characterize specific causal features, and domains can be interpreted as interventions on these objects that change non-causal features. We highlight an alternative condition: inputs across domains should have the same representation if they are derived from the same object. Based on this objective, we propose matching-based algorithms when base objects are observed (e.g., through data augmentation) and approximate the objective when objects are not observed (MatchDG). Our simple matching-based algorithms are competitive to prior work on out-of-domain accuracy for rotated MNIST, Fashion-MNIST, PACS, and Chest-Xray datasets. Our method MatchDG also recovers ground-truth object matches: on MNIST and Fashion-MNIST, top-10 matches from MatchDG have over 50% overlap with ground-truth matches.