2018 年 11 月

# 人工智能应用-仔细看一看强化学习

## 多臂老虎机问题

RL 中的经典问题之一是探索和利用之间的紧张。赌博机，通常称为"单臂老虎"是此问题的灵感。赌博机一家银行然后创建"多臂老虎机"。 每个这些槽机具有或不出大获全胜付款的概率。导致大获全胜每个打开的概率可能表示为 P，且不需要付费的概率是 1 – p。 如果计算机已.5 大获全胜概率 (JP)，每个请求级别具有赢得或失去的机会。相反，0.1 JP 的计算机将产生落选结果 90%的时间。

``````import numpy as np
import matplotlib.pyplot as plt
number_of_slot_machines = 5
np.random.seed(100)
JPs =  np.random.uniform(0,1, number_of_slot_machines)
print(JPs)
plt.bar(np.arange(len(JPs)),JPs)
plt.show()
``````

``````[0.54340494 0.27836939 0.42451759 0.84477613 0.00471886]
``````

``````known_JPs = np.zeros(number_of_slot_machines)
``````

``````def play_machine(slot_machine):
x = np.random.uniform(0, 1)
if (x <= JPs[slot_machine]):
return(10)
else:
return(-1)
``````

``````# Test Slot Machine 4
for machines in range(10):
print(play_machine(3))
print ("------")
# Test Slot Machine 5
for machines in range(10):
print(play_machine(4))
``````

## 测试 Epsilon 贪婪假设

``````def multi_armed_bandit(arms, iterations, epsilon):
total_reward, optimal_action = [], []
estimated_payout_odds = np.zeros(arms)
count = np.zeros(arms)
for i in range(0, iterations):
epsilon_random = np.random.uniform(0, 1)
if epsilon_random > epsilon :
# exploit
action = np.argmax(estimated_payout_odds)
else :
# explore
action = np.random.choice(np.arange(arms))
reward = play_machine(action)
estimated_payout_odds[action] = estimated_payout_odds[action] +
(1/(count[action]+1)) *
(reward - estimated_payout_odds[action])
total_reward.append(reward)
optimal_action.append(action == np.argmax(estimated_payout_odds))
count[action] += 1
return(estimated_payout_odds, total_reward)
``````

RL 代码，现在它后即可测试 Epsilon 贪婪算法。输入中的代码图 3到一个空单元格并执行它。结果显示从图表图 1为了便于参考后, 跟 RL 代码观察到的机率。

``````print ("Actual Odds")
plt.bar(np.arange(len(JPs)),JPs)
plt.show()
print (JPs)
print("----------------------------------")
iterations = 1000
print("\n----------------------------------")
print ("Learned Odds with epsilon of .1")
print("----------------------------------")
learned_payout_odds, reward = multi_armed_bandit(number_of_slot_machines, iterations, .1)
plt.bar(np.arange(len(learned_payout_odds)),learned_payout_odds)
plt.show()
print (learned_payout_odds)
print ("Reward: ", sum(reward))
``````

``````print("\n----------------------------------")
print ("Learned Odds with epsilon of 0")
print("----------------------------------")
learned_payout_odds, reward =
multi_armed_bandit(number_of_slot_machines, iterations, 0)
plt.bar(np.arange(len(learned_payout_odds)),learned_payout_odds)
plt.show()
print (learned_payout_odds)
print ("Reward: ", sum(reward))
``````

``````print("\n----------------------------------")
print ("Learned Odds with epsilon of 1")
print("----------------------------------")
learned_payout_odds, reward  = multi_armed_bandit(number_of_slot_machines, iterations, 1)
plt.bar(np.arange(len(learned_payout_odds)),learned_payout_odds)
plt.show()
print (learned_payout_odds)
print ("Reward: ", sum(reward))
``````

## 总结

RL 保持一个在 artificial intelligence 中最令人兴奋的空间。在本文中，我探讨了 Epsilon 贪婪算法与经典的"Multi-Armed 老虎机"问题，特别深化到资源管理器或攻击难题面临的代理。我鼓励您若要进一步探索通过实验得到 epsilon 和更大的赌博机的不同值之间权衡。

Frank La Vigne是 Microsoft 帮助公司的 AI 技术解决方案专业人员获得更多通过充分利用其数据与分析和 AI。他还共同主持 DataDriven 播客。他定期在 FranksWorld.com 和您可以看到他在他的 YouTube 频道"Frank's World TV"(FranksWorld.TV)。