The outbreak of the COVID-19 pandemic has had an unprecedented impact on China's labour market, and has largely changed the structure of labour supply and demand in different regions. It becomes critical for policy makers to understand the emerging dynamics of the post-pandemic labour market and provide the right policies for supporting the sustainable development of regional economies. To this end, in this paper, we provide a data-driven approach to assess and understand the evolving dynamics in regions' labour markets with large-scale online job search queries and job postings. In particular, we model the spatial-temporal patterns of labour flow and labour demand which reflect the attractiveness of regional labour markets. Our analysis shows that regional labour markets suffered from dramatic changes and demonstrated unusual signs of recovery during the pandemic. Specifically, the intention of labour flow quickly recovered with a trend of migrating from large to small cities and from northern to southern regions, respectively. Meanwhile, due to the pandemic, the demand of blue-collar workers has been substantially reduced compared to that of white-collar workers. In addition, the demand structure of blue-collar jobs also changed from manufacturing to service industries. Our findings reveal that the pandemic can cause varied impacts on regions with different structures of labour demand and control policies. This analysis provides timely information for both individuals and organizations in confronting the dynamic change in job markets during the extreme events, such as pandemics. Also, the governments can be better assisted for providing the right policies on job markets in facilitating the sustainable development of regions' economies.
Identifying latent variables and causal structures from observational data is essential to many real-world applications involving biological data, medical data, and unstructured data such as images and languages. However, this task can be highly challenging, especially when observed variables are generated by causally related latent variables and the relationships are nonlinear. In this work, we investigate the identification problem for nonlinear latent hierarchical causal models in which observed variables are generated by a set of causally related latent variables, and some latent variables may not have observed children. We show that the identifiability of both causal structure and latent variables can be achieved under mild assumptions: on causal structures, we allow for the existence of multiple paths between any pair of variables in the graph, which relaxes latent tree assumptions in prior work; on structural functions, we do not make parametric assumptions, thus permitting general nonlinearity and multi-dimensional continuous variables. Specifically, we first develop a basic identification criterion in the form of novel identifiability guarantees for an elementary latent variable model. Leveraging this criterion, we show that both causal structures and latent variables of the hierarchical model can be identified asymptotically by explicitly constructing an estimation procedure. To the best of our knowledge, our work is the first to establish identifiability guarantees for both causal structures and latent variables in nonlinear latent hierarchical models.
ChatGPT can improve Software Engineering (SE) research practices by offering efficient, accessible information analysis and synthesis based on natural language interactions. However, ChatGPT could bring ethical challenges, encompassing plagiarism, privacy, data security, and the risk of generating biased or potentially detrimental data. This research aims to fill the given gap by elaborating on the key elements: motivators, demotivators, and ethical principles of using ChatGPT in SE research. To achieve this objective, we conducted a literature survey, identified the mentioned elements, and presented their relationships by developing a taxonomy. Further, the identified literature-based elements (motivators, demotivators, and ethical principles) were empirically evaluated by conducting a comprehensive questionnaire-based survey involving SE researchers. Additionally, we employed Interpretive Structure Modeling (ISM) approach to analyze the relationships between the ethical principles of using ChatGPT in SE research and develop a level based decision model. We further conducted a Cross-Impact Matrix Multiplication Applied to Classification (MICMAC) analysis to create a cluster-based decision model. These models aim to help SE researchers devise effective strategies for ethically integrating ChatGPT into SE research by following the identified principles through adopting the motivators and addressing the demotivators. The findings of this study will establish a benchmark for incorporating ChatGPT services in SE research with an emphasis on ethical considerations.
Attention capitalism has generated design processes and product development decisions that prioritize platform growth over all other considerations. To the extent limits have been placed on these incentives, interventions have primarily taken the form of content moderation. While moderation is important for what we call "acute harms," societal-scale harms -- such as negative effects on mental health and social trust -- require new forms of institutional transparency and scientific investigation, which we group under the term accountability infrastructure. This is not a new problem. In fact, there are many conceptual lessons and implementation approaches for accountability infrastructure within the history of public health. After reviewing these insights, we reinterpret the societal harms generated by technology platforms through reference to public health. To that end, we present a novel mechanism design framework and practical measurement methods for that framework. The proposed approach is iterative and built into the product design process, and is applicable for both internally-motivated (i.e. self regulation by companies) and externally-motivated (i.e. government regulation) interventions for a range of societal problems, including mental health. We aim to help shape a research agenda of principles for the design of mechanisms around problem areas on which there is broad consensus and a firm base of support. We offer constructive examples and discussion of potential implementation methods related to these topics, as well as several new data illustrations for potential effects of exposure to online content.
Recurrent Neural Networks (RNNs) have shown great success in modeling time-dependent patterns, but there is limited research on their learned representations of latent temporal features and the emergence of these representations during training. To address this gap, we use timed automata (TA) to introduce a family of supervised learning tasks modeling behavior dependent on hidden temporal variables whose complexity is directly controllable. Building upon past studies from the perspective of dynamical systems, we train RNNs to emulate temporal flipflops, a new collection of TA that emphasizes the need for time-awareness over long-term memory. We find that these RNNs learn in phases: they quickly perfect any time-independent behavior, but they initially struggle to discover the hidden time-dependent features. In the case of periodic "time-of-day" aware automata, we show that the RNNs learn to switch between periodic orbits that encode time modulo the period of the transition rules. We subsequently apply fixed point stability analysis to monitor changes in the RNN dynamics during training, and we observe that the learning phases are separated by a bifurcation from which the periodic behavior emerges. In this way, we demonstrate how dynamical systems theory can provide insights into not only the learned representations of these models, but also the dynamics of the learning process itself. We argue that this style of analysis may provide insights into the training pathologies of recurrent architectures in contexts outside of time-awareness.
Leveraging an established exercise in negotiation education, we build a novel dataset for studying how the use of language shapes bilateral bargaining. Our dataset extends existing work in two ways: 1) we recruit participants via behavioral labs instead of crowdsourcing platforms and allow participants to negotiate through audio, enabling more naturalistic interactions; 2) we add a control setting where participants negotiate only through alternating, written numeric offers.Despite the two contrasting forms of communication, we find that the average agreed prices of the two treatments are identical. But when subjects can talk, fewer offers are exchanged, negotiations finish faster, the likelihood of reaching agreement rises, and the variance of prices at which subjects agree drops substantially. We further propose a taxonomy of speech acts in negotiation and enrich the dataset with annotated speech acts. We set up prediction tasks to predict negotiation success and find that being reactive to the arguments of the other party is advantageous over driving the negotiation.
Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
Background: Cannabis use disorder (CUD) is a growing public health problem. Early identification of adolescents and young adults at risk of developing CUD in the future may help stem this trend. A logistic regression model fitted using a Bayesian learning approach was developed recently to predict the risk of future CUD based on seven risk factors in adolescence and youth. A nationally representative longitudinal dataset, Add Health was used to train the model (henceforth referred as Add Health model). Methods: We validated the Add Health model on two cohorts, namely, Michigan Longitudinal Study (MLS) and Christchurch Health and Development Study (CHDS) using longitudinal data from participants until they were approximately 30 years old (to be consistent with the training data from Add Health). If a participant was diagnosed with CUD at any age during this period, they were considered a case. We calculated the area under the curve (AUC) and the ratio of expected and observed number of cases (E/O). We also explored re-calibrating the model to account for differences in population prevalence. Results: The cohort sizes used for validation were 424 (53 cases) for MLS and 637 (105 cases) for CHDS. AUCs for the two cohorts were 0.66 (MLS) and 0.73 (CHDS) and the corresponding E/O ratios (after recalibration) were 0.995 and 0.999. Conclusion: The external validation of the Add Health model on two different cohorts lends confidence to the model's ability to identify adolescent or young adult cannabis users at high risk of developing CUD in later life.
The notion of curvature on graphs has recently gained traction in the networks community, with the Ollivier-Ricci curvature (ORC) in particular being used for several tasks in network analysis, such as community detection. In this work, we choose a different approach and study augmentations of the discretization of the Ricci curvature proposed by Forman (AFRC). We empirically and theoretically investigate its relation to the ORC and the un-augmented Forman-Ricci curvature. In particular, we provide evidence that the AFRC frequently gives sufficient insight into the structure of a network to be used for community detection, and therefore provides a computationally cheaper alternative to previous ORC-based methods. Our novel AFRC-based community detection algorithm is competitive with an ORC-based approach. The codebase for fast and efficient computations of AFRC and the experiments in this article will be made publicly available upon publication.
Encompassing a diverse population of developers, non-technical users, organizations, and many other stakeholders, open source software (OSS) development has expanded to broader social movements from the initial product development aims. Ideology, as a coherent system of ideas, offers value commitments and normative implications for any social movement, so does OSS ideology for the open source movement. However, the literature on open source ideology is often fragile, or lacking in empirical evidence. In this paper, we sought to develop a comprehensive empirical theory of ideologies in open source software movement. Following a grounded theory procedure, we collected and analyzed data from 22 semi-structured interviews and 41 video recordings of Open Source Initiative (OSI) board members' public speeches. An empirical theory of OSS ideology emerged in our analysis, with six key categories: membership, norms/values, goals, activities, resources, and positions/group relations; each consists of a number of themes and subthemes. We discussed a subset of carefully selected themes and subthemes in detail based on their theoretical significance. With this ideological lens, we examined the implications and insights into open source development, and shed light on the research into open source as a social-cultural construction in the future.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.