Recommendation systems have witnessed significant advancements and have been widely used over the past decades. However, most traditional recommendation methods are task-specific and therefore lack efficient generalization ability. Recently, the emergence of ChatGPT has significantly advanced NLP tasks by enhancing the capabilities of conversational models. Nonetheless, the application of ChatGPT in the recommendation domain has not been thoroughly investigated. In this paper, we employ ChatGPT as a general-purpose recommendation model to explore its potential for transferring extensive linguistic and world knowledge acquired from large-scale corpora to recommendation scenarios. Specifically, we design a set of prompts and evaluate ChatGPT's performance on five recommendation scenarios. Unlike traditional recommendation methods, we do not fine-tune ChatGPT during the entire evaluation process, relying only on the prompts themselves to convert recommendation tasks into natural language tasks. Further, we explore the use of few-shot prompting to inject interaction information that contains user potential interest to help ChatGPT better understand user needs and interests. Comprehensive experimental results on Amazon Beauty dataset show that ChatGPT has achieved promising results in certain tasks and is capable of reaching the baseline level in others. We conduct human evaluations on two explainability-oriented tasks to more accurately evaluate the quality of contents generated by different models. And the human evaluations show ChatGPT can truly understand the provided information and generate clearer and more reasonable results. We hope that our study can inspire researchers to further explore the potential of language models like ChatGPT to improve recommendation performance and contribute to the advancement of the recommendation systems field.
The multi-criteria (MC) recommender system, which leverages MC rating information in a wide range of e-commerce areas, is ubiquitous nowadays. Surprisingly, although graph neural networks (GNNs) have been widely applied to develop various recommender systems due to GNN's high expressive capability in learning graph representations, it has been still unexplored how to design MC recommender systems with GNNs. In light of this, we make the first attempt towards designing a GNN-aided MC recommender system. Specifically, rather than straightforwardly adopting existing GNN-based recommendation methods, we devise a novel criteria preference-aware light graph convolution CPA-LGC method, which is capable of precisely capturing the criteria preference of users as well as the collaborative signal in complex high-order connectivities. To this end, we first construct an MC expansion graph that transforms user--item MC ratings into an expanded bipartite graph to potentially learn from the collaborative signal in MC ratings. Next, to strengthen the capability of criteria preference awareness, CPA-LGC incorporates newly characterized embeddings, including user-specific criteria-preference embeddings and item-specific criterion embeddings, into our graph convolution model. Through comprehensive evaluations using four real-world datasets, we demonstrate (a) the superiority over benchmark MC recommendation methods and benchmark recommendation methods using GNNs with tremendous gains, (b) the effectiveness of core components in CPA-LGC, and (c) the computational efficiency.
The task of the session-based recommendation is to predict the next interaction of the user based on the anonymized user's behavior pattern. And personalized version of this system is a promising research field due to its availability to deal with user information. However, there's a problem that the user's preferences and historical sessions were not considered in the typical session-based recommendation since it concentrates only on user-item interaction. In addition, the existing personalized session-based recommendation model has a limited capability in that it only considers the preference of the current user without considering those of similar users. It means there can be the loss of information included within the hierarchical data structure of the user-session-item. To tackle with this problem, we propose USP-SBR(abbr. of User Similarity Powered - Session Based Recommender). To model global historical sessions of users, we propose UserGraph that has two types of nodes - ItemNode and UserNode. We then connect the nodes with three types of edges. The first type of edges connects ItemNode as chronological order, and the second connects ItemNode to UserNode, and the last connects UserNode to ItemNode. With these user embeddings, we propose additional contrastive loss, that makes users with similar intention be close to each other in the vector space. we apply graph neural network on these UserGraph and update nodes. Experimental results on two real-world datasets demonstrate that our method outperforms some state-of-the-art approaches.
Narrative-driven recommendation (NDR) presents an information access problem where users solicit recommendations with verbose descriptions of their preferences and context, for example, travelers soliciting recommendations for points of interest while describing their likes/dislikes and travel circumstances. These requests are increasingly important with the rise of natural language-based conversational interfaces for search and recommendation systems. However, NDR lacks abundant training data for models, and current platforms commonly do not support these requests. Fortunately, classical user-item interaction datasets contain rich textual data, e.g., reviews, which often describe user preferences and context - this may be used to bootstrap training for NDR models. In this work, we explore using large language models (LLMs) for data augmentation to train NDR models. We use LLMs for authoring synthetic narrative queries from user-item interactions with few-shot prompting and train retrieval models for NDR on synthetic queries and user-item interaction data. Our experiments demonstrate that this is an effective strategy for training small-parameter retrieval models that outperform other retrieval and LLM baselines for narrative-driven recommendation.
Search engine plays a crucial role in satisfying users' diverse information needs. Recently, Pretrained Language Models (PLMs) based text ranking models have achieved huge success in web search. However, many state-of-the-art text ranking approaches only focus on core relevance while ignoring other dimensions that contribute to user satisfaction, e.g., document quality, recency, authority, etc. In this work, we focus on ranking user satisfaction rather than relevance in web search, and propose a PLM-based framework, namely SAT-Ranker, which comprehensively models different dimensions of user satisfaction in a unified manner. In particular, we leverage the capacities of PLMs on both textual and numerical inputs, and apply a multi-field input that modularizes each dimension of user satisfaction as an input field. Overall, SAT-Ranker is an effective, extensible, and data-centric framework that has huge potential for industrial applications. On rigorous offline and online experiments, SAT-Ranker obtains remarkable gains on various evaluation sets targeting different dimensions of user satisfaction. It is now fully deployed online to improve the usability of our search engine.
Existing user simulators (USs) for task-oriented dialogue systems only model user behaviour on semantic and natural language levels without considering the user persona and emotions. Optimising dialogue systems with generic user policies, which cannot model diverse user behaviour driven by different emotional states, may result in a high drop-off rate when deployed in the real world. Thus, we present EmoUS, a user simulator that learns to simulate user emotions alongside user behaviour. EmoUS generates user emotions, semantic actions, and natural language responses based on the user goal, the dialogue history, and the user persona. By analysing what kind of system behaviour elicits what kind of user emotions, we show that EmoUS can be used as a probe to evaluate a variety of dialogue systems and in particular their effect on the user's emotional state. Developing such methods is important in the age of large language model chat-bots and rising ethical concerns.
Recent investigations show that large language models (LLMs), specifically GPT-4, not only have remarkable capabilities in common Natural Language Processing (NLP) tasks but also exhibit human-level performance on various professional and academic benchmarks. However, whether GPT-4 can be directly used in practical applications and replace traditional artificial intelligence (AI) tools in specialized domains requires further experimental validation. In this paper, we explore the potential of LLMs such as GPT-4 to outperform traditional AI tools in dementia diagnosis. Comprehensive comparisons between GPT-4 and traditional AI tools are conducted to examine their diagnostic accuracy in a clinical setting. Experimental results on two real clinical datasets show that, although LLMs like GPT-4 demonstrate potential for future advancements in dementia diagnosis, they currently do not surpass the performance of traditional AI tools. The interpretability and faithfulness of GPT-4 are also evaluated by comparison with real doctors. We discuss the limitations of GPT-4 in its current state and propose future research directions to enhance GPT-4 in dementia diagnosis.
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
阿(a)里(li)最(zui)新ChatGPT在推(tui)薦任務上的表(biao)現(xian)評估,值(zhi)得(de)關注!
在過去的幾十年中,推薦系統取得了長足的進步,得到了廣泛的應用。然而,傳統的推薦方法大多是針對特定任務的,缺乏有效的泛化能力。最近,ChatGPT的出現通過增強會話模型的能力,大大推進了NLP任務。然而,ChatGPT在推薦領域的應用還沒有被深入研究。該文采用ChatGPT作為通用推薦模型,探索其將大規模語料庫中獲取的大量語言和世界知識遷移到推薦場景的潛力。具體地,我們設計了一套提示集,并在評分預測、序列推薦、直接推薦、解釋生成和評論摘要5種推薦場景下對ChatGPT的性能進行了評估。與傳統推薦方法不同,在整個評估過程中沒有對ChatGPT進行微調,僅依靠提示本身將推薦任務轉換為自然語言任務。進一步,探討了使用少樣本提示注入包含用戶潛在興趣的交互信息,以幫助ChatGPT更好地了解用戶需求和興趣。在Amazon Beauty數據集上的綜合實驗結果表明,ChatGPT在某些任務上取得了較好的效果,在其他任務上能夠達到基線水平。在兩個面向可解釋性的任務上進行了人工評估,以更準確地評估不同模型生成的內容的質量。人工評測表明ChatGPT能夠真正理解所提供的信息,并生成更清晰、更合理的結果。希(xi)望該(gai)研究能夠啟發研究人員進一步挖(wa)掘(jue)ChatGPT等語言模型(xing)提高(gao)推(tui)薦(jian)性能的潛(qian)力,為推(tui)動(dong)推(tui)薦(jian)系統領域的發展做(zuo)出貢獻。
//www.zhuanzhi.ai/paper/c1a5e954689aace228e596f676da195e
1. 引用
作為解決信息過載和增強用戶體驗的關鍵技術,推薦系統在過去的十年中取得了長足的進步,被(bei)廣(guang)泛(fan)應(ying)(ying)用于各(ge)種web應(ying)(ying)用中,如產品(pin)推(tui)薦(jian)(jian)[32,49,51,59]、視頻推(tui)薦(jian)(jian)[39,54,66]、新(xin)聞推(tui)薦(jian)(jian)[55-57]、音樂推(tui)薦(jian)(jian)[27,47]等。同時,隨(sui)著(zhu)深(shen)度學習的(de)發(fa)展,推(tui)薦(jian)(jian)系統也經歷(li)了(le)多個階段。在早期,基于協(xie)同過濾的(de)方法(fa)[5,6,44,62]主要用于從(cong)用戶-項目(mu)交互(hu)中對用戶的(de)行為(wei)模式進(jin)行建模。后來(lai),隨(sui)著(zhu)用戶和(he)項目(mu)邊信息被(bei)引(yin)入(ru)推(tui)薦(jian)(jian)系統,基于內(nei)容的(de)推(tui)薦(jian)(jian)[36,37,40,53,58]和(he)基于知(zhi)識(shi)的(de)推(tui)薦(jian)(jian)[2,8,16,18]因其(qi)能(neng)夠提供(gong)個性化推(tui)薦(jian)(jian)而受到關(guan)注。
然而,大多數傳統的推薦方法都是針對特定任務的。因此,針對不同的任務或應用場景,需要特定的數據來訓練特定的模型,缺乏高效的泛化能力。為了解決這個問題,研究人員將重點轉移到在推薦場景中實現預訓練語言模型(PLMs),因為PLMs表現出了令人印象深刻的適應性,可以顯著提高下游NLP任務的性能。為了有效地將用(yong)戶交(jiao)互數(shu)據(ju)轉換為文本序(xu)列,設計了各種提示[64]來將用(yong)戶交(jiao)互數(shu)據(ju)轉換為文本序(xu)列。P5[19]和(he)M6-Rec[11]側重于構建基礎模型,以支持廣泛的推薦任務。
最近,ChatGPT的出現通過增強對話模型的能力,大大推進了NLP任務,使其成為企業和組織的有價值的工具。Chataug等[12]利用ChatGPT來重新表達句子以實現文本數據增強。Jiao等,[23]發現ChatGPT在高資源語言和低資源語言上的翻譯能力都與商業翻譯產品具有競爭力。Bang等人[3]發現ChatGPT在情感分析任務中比之前最先進的零樣本模型有很大的優勢。然而,ChatGPT在推薦領域的應用還沒有深入研究,ChatGPT是否能在經典推薦任務上表現良好仍然是一個開放的問題。因此,有必要建立一個基準對ChatGPT與傳統推薦模型進行初步評估和比較,從而為進一步探索大規模語言模型在推薦系統中的潛力提供有價值的見解。 為了彌補這一研究空白,直接將ChatGPT作為一個可以處理各種推薦任務的通用推薦模型,嘗試探索從大規模語料庫中獲取的廣泛的語言和世界知識是否可以有效地遷移到推薦場景中。我們的主要貢獻是構建了一個基準來跟蹤ChatGPT在推薦場景中的表現,并對其優勢和局限性進行了全面的分析和討論。具體地,設計了(le)一套(tao)提(ti)示集(ji),并在評(ping)分預(yu)測、順序推(tui)薦、直接推(tui)薦、解(jie)釋(shi)生(sheng)成和評(ping)論摘要5個推(tui)薦任務上(shang)評(ping)估了(le)ChatGPT的性能(neng)。與(yu)傳統推(tui)薦方法(fa)不同(tong),在整個評(ping)估過(guo)程(cheng)中沒有對ChatGPT進(jin)行微調,僅依靠提(ti)示本身將推(tui)薦任務轉換為自然語言任務。此外,還(huan)探(tan)索了(le)使用少樣(yang)本提(ti)示注入包含用戶(hu)潛(qian)在興趣的交互信息(xi),以幫助ChatGPT更(geng)好地了(le)解(jie)用戶(hu)需求(qiu)和偏好。
在Amazon Beauty數據集上的綜合實驗結果表明,從準確率的角度來看,ChatGPT在評分預測方面表現良好,但在序列推薦和直接推薦任務中表現較差,在某些指標上僅達到與早期基線方法相近的性能水平。另一方(fang)(fang)面(mian)(mian),雖然ChatGPT在(zai)(zai)(zai)解(jie)釋(shi)生成和(he)評(ping)論摘要等可解(jie)釋(shi)推(tui)(tui)薦(jian)(jian)任務(wu)的(de)(de)(de)(de)(de)客觀評(ping)價(jia)指標(biao)方(fang)(fang)面(mian)(mian)表現不(bu)(bu)佳,但額外(wai)的(de)(de)(de)(de)(de)人工評(ping)估(gu)表明,ChatGPT優于最先進的(de)(de)(de)(de)(de)方(fang)(fang)法(fa)。這突(tu)出了(le)使用(yong)客觀評(ping)價(jia)方(fang)(fang)法(fa)來準確反映ChatGPT真(zhen)實(shi)可解(jie)釋(shi)推(tui)(tui)薦(jian)(jian)能(neng)力(li)(li)的(de)(de)(de)(de)(de)局限性。此外(wai),盡(jin)管ChatGPT在(zai)(zai)(zai)基(ji)于準確率的(de)(de)(de)(de)(de)推(tui)(tui)薦(jian)(jian)任務(wu)中表現不(bu)(bu)盡(jin)如人意(yi),但值得注意(yi)的(de)(de)(de)(de)(de)是,ChatGPT并沒有(you)在(zai)(zai)(zai)任何(he)推(tui)(tui)薦(jian)(jian)數據上進行(xing)專(zhuan)門的(de)(de)(de)(de)(de)訓練。因(yin)此,通過納入更多相關的(de)(de)(de)(de)(de)訓練數據和(he)技術(shu),在(zai)(zai)(zai)未來的(de)(de)(de)(de)(de)研究(jiu)(jiu)中仍有(you)很大(da)的(de)(de)(de)(de)(de)改進潛力(li)(li)。相信該基(ji)準不(bu)(bu)僅揭示了(le)ChatGPT的(de)(de)(de)(de)(de)推(tui)(tui)薦(jian)(jian)能(neng)力(li)(li),而且為(wei)研究(jiu)(jiu)人員(yuan)(yuan)更好地(di)了(le)解(jie)ChatGPT在(zai)(zai)(zai)推(tui)(tui)薦(jian)(jian)任務(wu)中的(de)(de)(de)(de)(de)優勢和(he)不(bu)(bu)足(zu)提(ti)供了(le)一個(ge)有(you)價(jia)值的(de)(de)(de)(de)(de)起點。此外(wai),我們(men)希望(wang)該研究(jiu)(jiu)能(neng)夠啟發(fa)研究(jiu)(jiu)人員(yuan)(yuan)設計新的(de)(de)(de)(de)(de)方(fang)(fang)法(fa),利用(yong)語言模型(如ChatGPT)的(de)(de)(de)(de)(de)優勢來提(ti)高推(tui)(tui)薦(jian)(jian)性能(neng),為(wei)推(tui)(tui)薦(jian)(jian)系統領域的(de)(de)(de)(de)(de)發(fa)展做出貢獻。
2. ChatGPT 推薦
使用ChatGPT完成推薦任務的工作流程如圖1所示,包括三個步驟。首先,根據推薦任務的具體特點構建不同的提示(第2.1節);其次,將這些提示信息作為ChatGPT的輸入,根據提示信息中指定的需求生成推薦結果;最后,通過細化模塊對ChatGPT的輸出進行檢查和細化,細化后的結果作為最終的推薦結果返回給用戶(章節2.2)。
2.1 針對任務的提示構建
本節研究(jiu)ChatGPT的(de)(de)推薦能力,設計針對不同(tong)任(ren)務(wu)的(de)(de)提(ti)示。每(mei)個提(ti)示由(you)三個部(bu)分組成(cheng):任(ren)務(wu)描述(shu)、行為注入和格式(shi)指示符。利(li)用任(ren)務(wu)描述(shu)使推薦任(ren)務(wu)適(shi)應自然語言處(chu)理(li)任(ren)務(wu)。行為注入旨在評估少樣本提(ti)示的(de)(de)影(ying)響(xiang),其中融入了用戶-物品(pin)交互,以幫助ChatGPT更(geng)有效地捕捉用戶偏好和需(xu)求(qiu)。格式(shi)指標用于約束輸出(chu)格式(shi),使推薦結果更(geng)易于理(li)解和評估。
2.1.1評分預測。評分(fen)預測是推(tui)薦(jian)系(xi)統中的一(yi)項關(guan)鍵任務,旨在預測用戶(hu)(hu)對特定(ding)物品(pin)的評分(fen)。這項任務對于(yu)為用戶(hu)(hu)個性(xing)(xing)化推(tui)薦(jian)和改善整體(ti)用戶(hu)(hu)體(ti)驗至關(guan)重(zhong)要(yao)。近年來,深度學習(xi)模型[20]和矩陣(zhen)分(fen)解技術[26]的使(shi)用,有(you)效(xiao)地(di)解決(jue)了(le)推(tui)薦(jian)系(xi)統中的稀疏性(xing)(xing)問題。與LLM的創新推(tui)薦(jian)范式一(yi)致,我(wo)們(men)(men)在評分(fen)任務上進行了(le)實驗,涉及(ji)制定(ding)兩種獨特的提(ti)示類型以引(yin)出結果(guo)。我(wo)們(men)(men)在圖2中提(ti)供了(le)一(yi)些示例(li)提(ti)示。
2.1.2 序列推薦。序列(lie)(lie)推(tui)薦(jian)是推(tui)薦(jian)系統(tong)的(de)(de)(de)(de)(de)(de)一(yi)個(ge)子(zi)領域,旨在(zai)(zai)(zai)(zai)(zai)根據用(yong)(yong)(yong)戶(hu)(hu)過去的(de)(de)(de)(de)(de)(de)順(shun)(shun)序行為預(yu)(yu)測(ce)用(yong)(yong)(yong)戶(hu)(hu)的(de)(de)(de)(de)(de)(de)下(xia)一(yi)個(ge)項(xiang)目(mu)(mu)(mu)或行動。由于其在(zai)(zai)(zai)(zai)(zai)電子(zi)商務、在(zai)(zai)(zai)(zai)(zai)線(xian)廣告、音樂推(tui)薦(jian)等領域的(de)(de)(de)(de)(de)(de)潛在(zai)(zai)(zai)(zai)(zai)應用(yong)(yong)(yong),近年(nian)來受到了(le)越(yue)來越(yue)多的(de)(de)(de)(de)(de)(de)關注。在(zai)(zai)(zai)(zai)(zai)序列(lie)(lie)推(tui)薦(jian)中(zhong),研究人(ren)員提(ti)出(chu)了(le)各(ge)種方法,包括(kuo)循環神(shen)經(jing)網絡[31]、對比(bi)學習[68]和基(ji)于注意力的(de)(de)(de)(de)(de)(de)模(mo)型[52],以捕(bu)獲用(yong)(yong)(yong)戶(hu)(hu)-物(wu)品(pin)交(jiao)(jiao)互(hu)中(zhong)的(de)(de)(de)(de)(de)(de)時間依賴(lai)和模(mo)式(shi)。為順(shun)(shun)序推(tui)薦(jian)任(ren)務族設計(ji)了(le)三種不同的(de)(de)(de)(de)(de)(de)提(ti)示(shi)格(ge)式(shi)。這(zhe)包括(kuo):1)根據用(yong)(yong)(yong)戶(hu)(hu)的(de)(de)(de)(de)(de)(de)交(jiao)(jiao)互(hu)歷(li)史(shi)(shi)直接預(yu)(yu)測(ce)用(yong)(yong)(yong)戶(hu)(hu)的(de)(de)(de)(de)(de)(de)下(xia)一(yi)個(ge)項(xiang)目(mu)(mu)(mu),2)從候選列(lie)(lie)表中(zhong)選擇一(yi)個(ge)可能的(de)(de)(de)(de)(de)(de)下(xia)一(yi)個(ge)項(xiang)目(mu)(mu)(mu),其中(zhong)只有一(yi)個(ge)項(xiang)目(mu)(mu)(mu)是積極(ji)的(de)(de)(de)(de)(de)(de),并基(ji)于用(yong)(yong)(yong)戶(hu)(hu)的(de)(de)(de)(de)(de)(de)交(jiao)(jiao)互(hu)歷(li)史(shi)(shi),3)使用(yong)(yong)(yong)用(yong)(yong)(yong)戶(hu)(hu)之前的(de)(de)(de)(de)(de)(de)交(jiao)(jiao)互(hu)歷(li)史(shi)(shi)作為基(ji)礎(chu),預(yu)(yu)測(ce)特定項(xiang)目(mu)(mu)(mu)是否會(hui)是用(yong)(yong)(yong)戶(hu)(hu)的(de)(de)(de)(de)(de)(de)下(xia)一(yi)個(ge)交(jiao)(jiao)互(hu)項(xiang)目(mu)(mu)(mu)。這(zhe)些(xie)提(ti)示(shi)格(ge)式(shi)旨在(zai)(zai)(zai)(zai)(zai)提(ti)高順(shun)(shun)序推(tui)薦(jian)的(de)(de)(de)(de)(de)(de)準確性和有效性,并以嚴格(ge)的(de)(de)(de)(de)(de)(de)學術原則為基(ji)礎(chu)。這(zhe)些(xie)提(ti)示(shi)的(de)(de)(de)(de)(de)(de)例子(zi)可以在(zai)(zai)(zai)(zai)(zai)圖2中(zhong)看到。
圖2: Beauty數據集上基于(yu)準確性(xing)的任務提(ti)示(shi)示(shi)例。黑色文本(ben)(ben)(ben)表(biao)(biao)示(shi)任務描述,紅色文本(ben)(ben)(ben)表(biao)(biao)示(shi)格(ge)式要求,藍(lan)色文本(ben)(ben)(ben)表(biao)(biao)示(shi)用戶歷史信(xin)息(xi)(xi)或(huo)少次信(xin)息(xi)(xi),灰色文本(ben)(ben)(ben)表(biao)(biao)示(shi)當前輸入(ru)。
2.1.3 直接推薦。直接推(tui)(tui)薦,也(ye)稱為顯(xian)式反饋推(tui)(tui)薦或基(ji)于(yu)評分(fen)(fen)的(de)推(tui)(tui)薦,是一類依賴用戶評分(fen)(fen)或評論形式的(de)顯(xian)式反饋的(de)推(tui)(tui)薦系(xi)(xi)統。與其他依賴隱式反饋(如(ru)用戶行為或購(gou)買歷(li)史)的(de)推(tui)(tui)薦系(xi)(xi)統不同,直接推(tui)(tui)薦系(xi)(xi)統通過考(kao)慮用戶的(de)顯(xian)式偏好,能夠(gou)提供更(geng)加個性化(hua)和(he)準確的(de)推(tui)(tui)薦。對于(yu)這項任(ren)務,開發(fa)了項目選擇(ze)提示,從潛在(zai)候選人(ren)列表中選擇(ze)最合適的(de)項目。這些提示格(ge)式基(ji)于(yu)嚴格(ge)的(de)學(xue)術原(yuan)則,旨在(zai)優化(hua)推(tui)(tui)薦的(de)準確性和(he)相(xiang)關(guan)性。這些提示的(de)例(li)子可以在(zai)圖2中看到。
2.1.4 解釋的生成。解(jie)釋(shi)生成(cheng)是指為(wei)用戶(hu)(hu)或系統設(she)計(ji)人員(yuan)提供解(jie)釋(shi),以(yi)(yi)闡明(ming)為(wei)什么(me)推薦(jian)這些(xie)項目。從而提高了(le)推薦(jian)系統的(de)(de)透明(ming)性、說服力、有效性、可(ke)信性和用戶(hu)(hu)滿意度。此外(wai),該模(mo)(mo)型(xing)便于系統設(she)計(ji)人員(yuan)對(dui)推薦(jian)算法(fa)進行診斷(duan)、調試(shi)和優化。ChatGPT等大(da)型(xing)語言模(mo)(mo)型(xing)可(ke)以(yi)(yi)利用其包(bao)含的(de)(de)大(da)量知(zhi)識,通(tong)過用戶(hu)(hu)的(de)(de)歷史交互記錄來了(le)解(jie)用戶(hu)(hu)的(de)(de)興趣,并為(wei)用戶(hu)(hu)的(de)(de)行為(wei)提供合理的(de)(de)解(jie)釋(shi)。具(ju)體來說,我們要求ChatGPT模(mo)(mo)型(xing)生成(cheng)文本(ben)解(jie)釋(shi),以(yi)(yi)證明(ming)用戶(hu)(hu)對(dui)所(suo)選(xuan)物品的(de)(de)偏好,如圖3所(suo)示。對(dui)于每個(ge)類別,可(ke)以(yi)(yi)包(bao)含額外(wai)的(de)(de)輔助(zhu)信息(xi),如提示詞和星(xing)級評級。
2.1.5 Review總結。隨著(zhu)人們對簡潔、易(yi)于(yu)理解(jie)的(de)(de)內容的(de)(de)需求(qiu)不斷(duan)增(zeng)長,自(zi)動生(sheng)成(cheng)摘要在自(zi)然語言處理中(zhong)變得越來越重要。與解(jie)釋生(sheng)成(cheng)任務(wu)類(lei)(lei)似,我們創建了兩種類(lei)(lei)型的(de)(de)提(ti)示:零樣(yang)本提(ti)示/少樣(yang)本提(ti)示,并在圖(tu)3中(zhong)提(ti)供了一些示例(li)提(ti)示。
3.2 輸出Refinement
為了(le)保證生(sheng)成(cheng)結果(guo)的(de)(de)(de)(de)多樣性(xing),ChatGPT在(zai)(zai)其(qi)響應生(sheng)成(cheng)過(guo)程中(zhong)加入(ru)了(le)一(yi)定(ding)程度的(de)(de)(de)(de)隨機性(xing),這可能(neng)會導致對(dui)(dui)于(yu)(yu)相(xiang)同(tong)的(de)(de)(de)(de)輸(shu)入(ru)產生(sheng)不(bu)同(tong)的(de)(de)(de)(de)響應。然而,在(zai)(zai)使(shi)用ChatGPT進(jin)行(xing)推(tui)(tui)(tui)薦(jian)(jian)時,這種(zhong)(zhong)隨機性(xing)有時會給(gei)評估推(tui)(tui)(tui)薦(jian)(jian)項目(mu)帶(dai)來(lai)困(kun)難。雖然prompt構(gou)造中(zhong)的(de)(de)(de)(de)格(ge)式(shi)指(zhi)示器可以(yi)在(zai)(zai)一(yi)定(ding)程度上緩解(jie)這個問(wen)題(ti),但(dan)(dan)在(zai)(zai)實際使(shi)用中(zhong),它仍然不(bu)能(neng)保證預(yu)(yu)期的(de)(de)(de)(de)輸(shu)出(chu)(chu)格(ge)式(shi)。因(yin)此(ci),我們設計了(le)輸(shu)出(chu)(chu)細化(hua)模塊來(lai)檢(jian)測ChatGPT的(de)(de)(de)(de)輸(shu)出(chu)(chu)格(ge)式(shi)。如(ru)果(guo)輸(shu)出(chu)(chu)通過(guo)格(ge)式(shi)檢(jian)查,則直接(jie)將(jiang)(jiang)(jiang)其(qi)作為最(zui)終(zhong)輸(shu)出(chu)(chu)。如(ru)果(guo)沒有,則根據(ju)預(yu)(yu)定(ding)義(yi)的(de)(de)(de)(de)規則進(jin)行(xing)修(xiu)改。如(ru)果(guo)格(ge)式(shi)校正(zheng)成(cheng)功(gong),則將(jiang)(jiang)(jiang)校正(zheng)后的(de)(de)(de)(de)結果(guo)用作最(zui)終(zhong)輸(shu)出(chu)(chu)。如(ru)果(guo)沒有,則將(jiang)(jiang)(jiang)相(xiang)應的(de)(de)(de)(de)提示輸(shu)入(ru)ChatGPT進(jin)行(xing)重新推(tui)(tui)(tui)薦(jian)(jian),直到(dao)滿足格(ge)式(shi)要求(qiu)。值(zhi)得注意(yi)的(de)(de)(de)(de)是,在(zai)(zai)評估ChatGPT時,不(bu)同(tong)的(de)(de)(de)(de)任務對(dui)(dui)輸(shu)出(chu)(chu)格(ge)式(shi)有不(bu)同(tong)的(de)(de)(de)(de)要求(qiu)。例如(ru),對(dui)(dui)于(yu)(yu)評分預(yu)(yu)測,只(zhi)需要一(yi)個特定(ding)的(de)(de)(de)(de)分數,而對(dui)(dui)于(yu)(yu)順序推(tui)(tui)(tui)薦(jian)(jian)或直接(jie)推(tui)(tui)(tui)薦(jian)(jian),則需要一(yi)個推(tui)(tui)(tui)薦(jian)(jian)項目(mu)的(de)(de)(de)(de)列(lie)(lie)表(biao)。特別是對(dui)(dui)于(yu)(yu)序列(lie)(lie)推(tui)(tui)(tui)薦(jian)(jian),一(yi)次性(xing)將(jiang)(jiang)(jiang)數據(ju)集中(zhong)的(de)(de)(de)(de)所(suo)有項目(mu)提供給(gei)ChatGPT是一(yi)個挑戰。因(yin)此(ci),ChatGPT的(de)(de)(de)(de)輸(shu)出(chu)(chu)可能(neng)與(yu)數據(ju)集中(zhong)的(de)(de)(de)(de)項集不(bu)匹配(pei)。針對(dui)(dui)這一(yi)問(wen)題(ti),該(gai)文(wen)提出(chu)(chu)了(le)一(yi)種(zhong)(zhong)基于(yu)(yu)相(xiang)似度的(de)(de)(de)(de)文(wen)本匹配(pei)方法,將(jiang)(jiang)(jiang)ChatGPT的(de)(de)(de)(de)預(yu)(yu)測結果(guo)映(ying)射回原(yuan)始數據(ju)集。雖然該(gai)方法可能(neng)不(bu)能(neng)很好(hao)地反映(ying)ChatGPT的(de)(de)(de)(de)能(neng)力,但(dan)(dan)它仍然可以(yi)間接(jie)地展示其(qi)在(zai)(zai)序列(lie)(lie)推(tui)(tui)(tui)薦(jian)(jian)中(zhong)的(de)(de)(de)(de)潛力
3 評價
為了評估ChatGPT,在Amazon真實數據集上進行了廣泛的實驗。通過與各種代表性方法和消融研究在不同任務上的性能比較,旨在回答以下研究問題: RQ1:與最先進的基準模型相比,ChatGPT的性能如何?
RQ2:少提示對性能有什么影響?
RQ3:如何設計人工評價來評估解釋生成和摘要任務?
**3.3.1 評分預測。**為了評估ChatGPT的評分預測性能,采用了零樣本和少樣本提示,從Beauty數據集上得到的結果總結在表1中。結果表明,對于Beauty數據集上看到的類別,少樣本提示在MAE和RMSE方面都優于MF和MLP。這些結果為利用條件文本生成框架進行評分預測的可行性提供了證據。
3.3.2 序列推薦。為了(le)(le)評估(gu)ChatGPT的(de)(de)(de)序列推(tui)薦(jian)(jian)能(neng)力,我們(men)分別進(jin)(jin)行(xing)了(le)(le)零樣本和少樣本實(shi)驗(yan),實(shi)驗(yan)結果(guo)如表2所示(shi)(shi)。我們(men)發現,與基線相比(bi),ChatGPT在(zai)零樣本提示(shi)(shi)設(she)置中的(de)(de)(de)性(xing)能(neng)要差得多,所有指(zhi)(zhi)標(biao)都(dou)明顯低(di)于基線。然(ran)而(er)(er),在(zai)小樣本提示(shi)(shi)設(she)置下(xia),雖然(ran)ChatGPT在(zai)性(xing)能(neng)上有了(le)(le)相對的(de)(de)(de)提升(sheng),例如NDCG@5超過了(le)(le)GRU4Rec,但(dan)(dan)在(zai)大(da)多數(shu)情(qing)況下(xia),ChatGPT仍(reng)然(ran)普遍優于經典的(de)(de)(de)序列推(tui)薦(jian)(jian)方法。可(ke)能(neng)有兩個(ge)主要原因導致(zhi)(zhi)了(le)(le)這(zhe)種(zhong)結果(guo):首先,在(zai)提示(shi)(shi)設(she)計過程(cheng)中,所有item都(dou)由它們(men)的(de)(de)(de)標(biao)題(ti)(ti)表示(shi)(shi)。雖然(ran)該方法可(ke)以(yi)在(zai)一(yi)定程(cheng)度上緩解冷啟動問題(ti)(ti),但(dan)(dan)可(ke)能(neng)導致(zhi)(zhi)ChatGPT更關(guan)注語(yu)義相似性(xing)而(er)(er)不(bu)是項目(mu)之間的(de)(de)(de)遷移關(guan)系,而(er)(er)這(zhe)對有效推(tui)薦(jian)(jian)至關(guan)重要。其(qi)次(ci),由于提示(shi)(shi)信息的(de)(de)(de)長度限制,無法將(jiang)物(wu)品集(ji)合中的(de)(de)(de)所有物(wu)品輸入(ru)ChatGPT。這(zhe)導致(zhi)(zhi)ChatGPT在(zai)預測下(xia)一(yi)個(ge)物(wu)品的(de)(de)(de)標(biao)題(ti)(ti)時缺乏約束(shu),導致(zhi)(zhi)生成(cheng)的(de)(de)(de)物(wu)品標(biao)題(ti)(ti)在(zai)數(shu)據集(ji)中不(bu)存在(zai)。雖然(ran)可(ke)以(yi)通過語(yu)義相似性(xing)匹配將(jiang)這(zhe)些預測的(de)(de)(de)標(biao)題(ti)(ti)映(ying)射(she)到(dao)數(shu)據集(ji)中現有的(de)(de)(de)標(biao)題(ti)(ti),但(dan)(dan)實(shi)驗(yan)表明,這(zhe)種(zhong)映(ying)射(she)并沒有帶來顯著的(de)(de)(de)增益。因此,對于順序推(tui)薦(jian)(jian)任務,僅(jin)僅(jin)使用ChatGPT并不(bu)是一(yi)個(ge)合適的(de)(de)(de)選(xuan)擇。需要進(jin)(jin)一(yi)步探(tan)索引(yin)入(ru)更多的(de)(de)(de)指(zhi)(zhi)導和約束(shu),以(yi)幫助ChatGPT準確捕(bu)捉歷史興趣,并在(zai)有限的(de)(de)(de)范圍內做(zuo)出合理的(de)(de)(de)推(tui)薦(jian)(jian)。
3.3.3直接推薦。表3展(zhan)示(shi)了ChatGPT在直接(jie)推(tui)(tui)薦任務上的(de)(de)(de)性(xing)(xing)能。與(yu)順序(xu)推(tui)(tui)薦不同,直接(jie)推(tui)(tui)薦要求推(tui)(tui)薦模(mo)型從有限(xian)的(de)(de)(de)物品(pin)池(chi)中(zhong)選擇與(yu)用(yong)(yong)戶(hu)最(zui)相關的(de)(de)(de)物品(pin)。我(wo)們觀察到,在使用(yong)(yong)零(ling)樣本提(ti)示(shi)時(shi),推(tui)(tui)薦性(xing)(xing)能明顯低于(yu)(yu)有監督推(tui)(tui)薦模(mo)型。這可(ke)(ke)以歸(gui)因于(yu)(yu)ChatGPT提(ti)供的(de)(de)(de)信息(xi)不足,導致無(wu)法捕捉用(yong)(yong)戶(hu)興趣并生成更隨機的(de)(de)(de)推(tui)(tui)薦。雖然少樣本提(ti)示(shi)可(ke)(ke)以通過(guo)提(ti)供用(yong)(yong)戶(hu)的(de)(de)(de)一些歷史(shi)偏好來(lai)提(ti)高ChatGPT的(de)(de)(de)推(tui)(tui)薦性(xing)(xing)能,但仍(reng)然未能超過(guo)基線性(xing)(xing)能。
結論
該文構建了一個評測ChatGPT在推薦任務中的性能的基準,并與傳統的推薦模型進行了比較。實驗結果表明,ChatGPT在評分預測上表現良好,但在順序推薦和直接推薦任務上表現較差,表明還需要進一步探索和改進。盡管有局限性,但ChatGPT在可解釋推薦任務的人工評價方面優于最先進的方法,突出了其在生成解釋和摘要方面的潛力。該研(yan)(yan)究(jiu)對ChatGPT在(zai)推(tui)薦(jian)系統中(zhong)的優勢和局限性提供了有價值的見解(jie),希望能(neng)啟(qi)發未來(lai)探索(suo)使(shi)用(yong)大型語言模(mo)型來(lai)提高推(tui)薦(jian)性能(neng)的研(yan)(yan)究(jiu)。展望未來(lai),我(wo)們計劃研(yan)(yan)究(jiu)更好的方(fang)法(fa),將用(yong)戶交互數據納入大型語言模(mo)型,彌合語言和用(yong)戶興趣之間的語義(yi)鴻溝。
OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is demonstrated to be one small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI). Since its official release in November 2022, ChatGPT has quickly attracted numerous users with extensive media coverage. Such unprecedented attention has also motivated numerous researchers to investigate ChatGPT from various aspects. According to Google scholar, there are more than 500 articles with ChatGPT in their titles or mentioning it in their abstracts. Considering this, a review is urgently needed, and our work fills this gap. Overall, this work is the first to survey ChatGPT with a comprehensive review of its underlying technology, applications, and challenges. Moreover, we present an outlook on how ChatGPT might evolve to realize general-purpose AIGC (a.k.a. AI-generated content), which will be a significant milestone for the development of AGI.
State-of-the-art recommendation algorithms -- especially the collaborative filtering (CF) based approaches with shallow or deep models -- usually work with various unstructured information sources for recommendation, such as textual reviews, visual images, and various implicit or explicit feedbacks. Though structured knowledge bases were considered in content-based approaches, they have been largely neglected recently due to the availability of vast amount of data, and the learning power of many complex models. However, structured knowledge bases exhibit unique advantages in personalized recommendation systems. When the explicit knowledge about users and items is considered for recommendation, the system could provide highly customized recommendations based on users' historical behaviors. A great challenge for using knowledge bases for recommendation is how to integrated large-scale structured and unstructured data, while taking advantage of collaborative filtering for highly accurate performance. Recent achievements on knowledge base embedding sheds light on this problem, which makes it possible to learn user and item representations while preserving the structure of their relationship with external knowledge. In this work, we propose to reason over knowledge base embeddings for personalized recommendation. Specifically, we propose a knowledge base representation learning approach to embed heterogeneous entities for recommendation. Experimental results on real-world dataset verified the superior performance of our approach compared with state-of-the-art baselines.