主題: Locally Differentially Private (Contextual) Bandits Learning
摘要:
首先,我們提出了一種簡單的黑盒歸約框架,該框架可以解決帶有LDP保證的大量無背景的bandits學習問題。根據我們的框架,我們可以通過單點反饋(例如 private bandits凸優化等)改善private bandits學習的最佳結果,并在LDP下獲得具有多點反饋的BCO的第一結果。 LDP保證和黑盒特性使我們的框架在實際應用中比以前專門設計的和相對較弱的差分專用(DP)上下文無關強盜算法更具吸引力。此外,我們還將算法擴展到在(ε,δ)-LDP下具有遺憾約束ō(T~3/4 /ε)的廣義線性bandits,這被認為是最優的。注意,給定DP上下文線性bandits的現有Ω(T)下界,我們的結果表明LDP和DP上下文bandits之間的根本區別。
零樣本學習依賴語義類表示,如屬性或預訓練的嵌入來預測類,而不需要任何標記示例。我們提出從常識知識圖譜學習類表示。常識知識圖譜是顯性高級知識,它只需要很少的人力投入就可以應用到一系列任務中。為了捕獲圖中的知識,我們引入了ZSL-KG,這是一個基于圖神經網絡的框架,帶有非線性聚合器來生成類表示。盡管之前大多數關于圖神經網絡的工作都是使用線性函數來聚合鄰近節點的信息,但我們發現,非線性聚合器(如LSTMs或transformer)可以顯著改善零目標任務。在跨三個數據集的兩個自然語言任務中,ZSL-KG顯示出與最先進的方法相比平均提高9.2點的準確性。此外,在對象分類任務上,ZSL-KG與不需要手工設計類表示的最佳方法相比,精度點提高了2.2。最后,我們發現ZSL-KG在這四個數據集上的平均準確率比使用線性聚合器的最佳圖神經網絡高出3.8點。
主題: Explainable Reinforcement Learning: A Survey
摘要: 可解釋的人工智能(XAI),即更透明和可解釋的AI模型的開發在過去幾年中獲得了越來越多的關注。這是由于這樣一個事實,即AI模型隨著其發展為功能強大且無處不在的工具而表現出一個有害的特征:性能與透明度之間的權衡。這說明了一個事實,即模型的內部工作越復雜,就越難以實現其預測或決策。但是,特別是考慮到系統像機器學習(ML)這樣的方法(強化學習(RL))在系統自動學習的情況下,顯然有必要了解其決策的根本原因。由于據我們所知,目前尚無人提供可解釋性強化學習(XRL)方法的概述的工作,因此本調查試圖解決這一差距。我們對問題進行了簡短的總結,重要術語的定義以及提議當前XRL方法的分類和評估。我們發現a)大多數XRL方法通過模仿和簡化一個復雜的模型而不是設計本質上簡單的模型來起作用,并且b)XRL(和XAI)方法通常忽略了方程的人為方面,而不考慮相關領域的研究像心理學或哲學。因此,需要跨學科的努力來使所生成的解釋適應(非專家)人類用戶,以便有效地在XRL和XAI領域中取得進步。
題目: Bayesian Neural Networks With Maximum Mean Discrepancy Regularization
摘要: 貝葉斯神經網絡(BNNs)訓練來優化整個分布的權重,而不是一個單一的集合,在可解釋性、多任務學習和校準等方面具有顯著的優勢。由于所得到的優化問題的難解性,大多數BNNs要么通過蒙特卡羅方法采樣,要么通過在變分近似上最小化一個合適的樣本下界(ELBO)來訓練。在這篇論文中,我們提出了后者的一個變體,其中我們用最大平均偏差(MMD)估計器代替了ELBO項中的Kullback-Leibler散度,這是受到了最近的變分推理工作的啟發。在根據MMD術語的性質提出我們的建議之后,我們接著展示了公式相對于最先進的公式的一些經驗優勢。特別地,我們的BNNs在多個基準上實現了更高的準確性,包括多個圖像分類任務。此外,它們對權重上的先驗選擇更有魯棒性,而且它們的校準效果更好。作為第二項貢獻,我們提供了一個新的公式來估計給定預測的不確定性,表明與更經典的標準(如微分熵)相比,它在對抗攻擊和輸入噪聲的情況下表現得更穩定。
持續的終身學習需要一個代理或模型學習許多按順序排列的任務,建立在以前的知識上而不是災難性地忘記它。許多工作都是為了防止機器學習模型的默認趨勢災難性地遺忘,但實際上所有這些工作都涉及到手工設計的問題解決方案。我們主張元學習是一種解決災難性遺忘的方法,允許人工智能不斷學習。受大腦神經調節過程的啟發,我們提出了一種神經調節元學習算法(ANML)。它通過一個連續的學習過程來區分元學習一個激活門控功能,使上下文相關的選擇激活在深度神經網絡中成為可能。具體地說,一個神經調節(NM)神經網絡控制另一個(正常的)神經網絡的前向通道,稱為預測學習網絡(PLN)。NM網絡也因此間接地控制PLN的選擇性可塑性(即PLN的后向通徑)。ANML支持持續學習而不會出現大規模的災難性遺忘:它提供了最先進的連續學習性能,連續學習多達600個類(超過9000個SGD更新)。
論文題目: Privacy-Preserving Gradient Boosting Decision Trees
論文作者: Qinbin Li, Zhaomin Wu, Zeyi Wen, Bingsheng He
論文摘要: 梯度提升決策樹(GBDT)是近年來用于各種任務的流行機器學習模型。在本文中,我們研究如何在保留差異性隱私的有力保證的同時提高GBDT的模型準確性。敏感度和隱私預算是差異化私人模型有效性的兩個關鍵設計方面。現有的具有差分隱私保護的GBDT解決方案由于過于寬松的敏感性界限和無效的隱私預算分配(尤其是GBDT模型中的不同樹)而導致嚴重的準確性損失。松散的靈敏度范圍導致更多的噪聲以獲得固定的優先級。無效的隱私預算分配使準確性降低,尤其是在樹的數量很大時。因此,我們提出了一種新的GBDT訓練算法,該算法可實現更嚴格的靈敏度范圍和更有效的噪聲分配。具體而言,通過研究梯度的屬性和每棵樹在GBDT中的貢獻,我們提出針對每個迭代和葉節點修剪自適應地控制訓練數據的梯度,以收緊敏感度范圍。此外,我們設計了一種新穎的增強框架,可以在樹之間分配隱私預算,從而可以減少準確性損失。我們的實驗表明,與其他基準相比,我們的方法可以實現更好的模型準確性。
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Deep hierarchical reinforcement learning has gained a lot of attention in recent years due to its ability to produce state-of-the-art results in challenging environments where non-hierarchical frameworks fail to learn useful policies. However, as problem domains become more complex, deep hierarchical reinforcement learning can become inefficient, leading to longer convergence times and poor performance. We introduce the Deep Nested Agent framework, which is a variant of deep hierarchical reinforcement learning where information from the main agent is propagated to the low level $nested$ agent by incorporating this information into the nested agent's state. We demonstrate the effectiveness and performance of the Deep Nested Agent framework by applying it to three scenarios in Minecraft with comparisons to a deep non-hierarchical single agent framework, as well as, a deep hierarchical framework.
We consider the multi-agent reinforcement learning setting with imperfect information in which each agent is trying to maximize its own utility. The reward function depends on the hidden state (or goal) of both agents, so the agents must infer the other players' hidden goals from their observed behavior in order to solve the tasks. We propose a new approach for learning in these domains: Self Other-Modeling (SOM), in which an agent uses its own policy to predict the other agent's actions and update its belief of their hidden state in an online manner. We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players' hidden states, in both cooperative and adversarial settings.