Data mining has various real-time applications in fields such as finance telecommunications, biology, and government. Classification is a primary task in data mining. With the rise of cloud computing, users can outsource and access their data from anywhere, offloading data and it is processing to the cloud. However, in public cloud environments while data is often encrypted, the cloud service provider typically controls the encryption keys, meaning they can potentially access the data at any time. This situation makes traditional privacy-preserving classification systems inadequate. The recommended protocol ensures data privacy, protects user queries, and conceals access patterns. Given that encrypted data on the cloud cannot be directly mined, we focus on a secure k nearest neighbor classification algorithm for encrypted, outsourced data. This approach maintains the privacy of user queries and data access patterns while allowing effective data mining operations to be conducted securely in the cloud. With cloud computing, particularly in public cloud environments, the encryption of data necessitates advanced methods like secure k nearest neighbor algorithms to ensure privacy and functionality in data mining. This innovation protects sensitive information and user privacy, addressing the challenges posed by traditional systems where cloud providers control encryption keys.
The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only partial information is shared. In this paper, we propose a novel Bayesian transfer learning method named ``CONCERT'' to allow robust partial information transfer for high-dimensional data analysis. A conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize partial similarities and integrate source information collaboratively to improve the performance on the target. In contrast to existing work, the CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. We establish variable selection consistency, as well as estimation and prediction error bounds for CONCERT. Our theory demonstrates the covariate-specific benefit of transfer learning. To ensure that our algorithm is scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and two real data applications showcase the validity and advantage of CONCERT over existing cutting-edge transfer learning methods.
Automatic Program Repair (APR) endeavors to autonomously rectify issues within specific projects, which generally encompasses three categories of tasks: bug resolution, new feature development, and feature enhancement. Despite extensive research proposing various methodologies, their efficacy in addressing real issues remains unsatisfactory. It's worth noting that, typically, engineers have design rationales (DR) on solution-planed solutions and a set of underlying reasons-before they start patching code. In open-source projects, these DRs are frequently captured in issue logs through project management tools like Jira. This raises a compelling question: How can we leverage DR scattered across the issue logs to efficiently enhance APR? To investigate this premise, we introduce DRCodePilot, an approach designed to augment GPT-4-Turbo's APR capabilities by incorporating DR into the prompt instruction. Furthermore, given GPT-4's constraints in fully grasping the broader project context and occasional shortcomings in generating precise identifiers, we have devised a feedback-based self-reflective framework, in which we prompt GPT-4 to reconsider and refine its outputs by referencing a provided patch and suggested identifiers. We have established a benchmark comprising 938 issue-patch pairs sourced from two open-source repositories hosted on GitHub and Jira. Our experimental results are impressive: DRCodePilot achieves a full-match ratio that is a remarkable 4.7x higher than when GPT-4 is utilized directly. Additionally, the CodeBLEU scores also exhibit promising enhancements. Moreover, our findings reveal that the standalone application of DR can yield promising increase in the full-match ratio across CodeLlama, GPT-3.5, and GPT-4 within our benchmark suite. We believe that our DRCodePilot initiative heralds a novel human-in-the-loop avenue for advancing the field of APR.
Content moderation is a widely used strategy to prevent the dissemination of irregular information on social media platforms. Despite extensive research on developing automated models to support decision-making in content moderation, there remains a notable scarcity of studies that integrate the rules of online communities into content moderation. This study addresses this gap by proposing a community rule-based content moderation framework that directly integrates community rules into the moderation of user-generated content. Our experiment results with datasets collected from two domains demonstrate the superior performance of models based on the framework to baseline models across all evaluation metrics. In particular, incorporating community rules substantially enhances model performance in content moderation. The findings of this research have significant research and practical implications for improving the effectiveness and generalizability of content moderation models in online communities.
The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.
High-dimensional variable selection has emerged as one of the prevailing statistical challenges in the big data revolution. Many variable selection methods have been adapted for identifying single nucleotide polymorphisms (SNPs) linked to phenotypic variation in genome-wide association studies. We develop a Bayesian variable selection regression model for identifying SNPs linked to phenotypic variation. We modify our Bayesian variable selection regression models to control the false discovery rate of SNPs using a knockoff variable approach. We reduce spurious associations by regressing the phenotype of interest against a set of basis functions that account for the relatedness of individuals. Using a restricted regression approach, we simultaneously estimate the SNP-level effects while removing variation in the phenotype that can be explained by population structure. We also accommodate the spatial structure among causal SNPs by modeling their inclusion probabilities jointly with a reduced rank Gaussian process. In a simulation study, we demonstrate that our spatial Bayesian variable selection regression model controls the false discovery rate and increases power when the relevant SNPs are clustered. We conclude with an analysis of Arabidopsis thaliana flowering time, a polygenic trait that is confounded with population structure, and find the discoveries of our method cluster near described flowering time genes.
Geolocation is now a vital aspect of modern life, offering numerous benefits but also presenting serious privacy concerns. The advent of large vision-language models (LVLMs) with advanced image-processing capabilities introduces new risks, as these models can inadvertently reveal sensitive geolocation information. This paper presents the first in-depth study analyzing the challenges posed by traditional deep learning and LVLM-based geolocation methods. Our findings reveal that LVLMs can accurately determine geolocations from images, even without explicit geographic training. To address these challenges, we introduce \tool{}, an innovative framework that significantly enhances image-based geolocation accuracy. \tool{} employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies by carefully analyzing visual and contextual cues such as vehicle types, architectural styles, natural landscapes, and cultural elements. Extensive testing on a dataset of 50,000 ground-truth data points shows that \tool{} outperforms both traditional models and human benchmarks in accuracy. It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37\% win rate, and delivers highly precise geolocation predictions, with the closest distances as accurate as 0.3 km. Furthermore, our study highlights issues related to dataset integrity, leading to the creation of a more robust dataset and a refined framework that leverages LVLMs' cognitive capabilities to improve geolocation precision. These findings underscore \tool{}'s superior ability to interpret complex visual data, the urgent need to address emerging security vulnerabilities posed by LVLMs, and the importance of responsible AI development to ensure user privacy protection.
Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.
Intelligent reflecting surfaces (IRSs) have been regarded as a promising enabler for future wireless communication systems. In the literature, IRSs have been considered power-free or assumed to have constant power consumption. However, recent experimental results have shown that for positive-intrinsic-negative (PIN) diode-based IRSs, the power consumption dynamically changes with the phase shift configuration. This phase shift-dependent power consumption (PS-DPC) introduces a challenging power allocation problem between base station (BS) and IRS. To tackle this issue, in this paper, we investigate a rate maximization problem for IRS-assisted systems under a practical PS-DPC model. For the single-user case, we propose a generalized Benders decomposition-based beamforming method to maximize the achievable rate while satisfying a total system power consumption constraint. Moreover, we propose a low-complexity beamforming design, where the powers allocated to BS and IRS are optimized offline based on statistical channel state information. Furthermore, for the multi-user case, we solve an equivalent weighted mean square error minimization problem with two different joint power allocation and phase shift optimization methods. Simulation results indicate that compared to baseline schemes, our proposed methods can flexibly optimize the power allocation between BS and IRS, thus achieving better performance. The optimized power allocation strategy strongly depends on the system power budget. When the system power budget is high, the PS-DPC is not the dominant factor in the system power consumption, allowing the IRS to turn on as many PIN diodes as needed to achieve high beamforming quality. When the system power budget is limited, however, more power tends to be allocated to the BS to enhance the transmit power, resulting in a lower beamforming quality at the IRS due to the reduced PS-DPC budget.
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Aspect level sentiment classification aims to identify the sentiment expressed towards an aspect given a context sentence. Previous neural network based methods largely ignore the syntax structure in one sentence. In this paper, we propose a novel target-dependent graph attention network (TD-GAT) for aspect level sentiment classification, which explicitly utilizes the dependency relationship among words. Using the dependency graph, it propagates sentiment features directly from the syntactic context of an aspect target. In our experiments, we show our method outperforms multiple baselines with GloVe embeddings. We also demonstrate that using BERT representations further substantially boosts the performance.