13-1. 강화학습(Bandit)

Author

이상민

Published

May 28, 2025

1. imports

import numpy as np
import collections

2. 강화학습 Intro

- 강화학습(대충설명): 어떠한 “(게임)환경”이 있을때 거기서 “뭘 할지”를 학습하는 과업

그림1: 셔튼(Sutton, Barto, et al. (1998))의 교재에서 발췌한 그림, 되게 유명한 그림이에요

- 딥마인드: breakout \(\to\) 알파고

https://www.youtube.com/watch?v=TmPfTpjtdgg

- 강화학습에서 “강화”는 뭘 강화한다는것일까?

https://k9connoisseur.com/blogs/news/positive-reinforcement-dog-training

- 강화학습 미래? (이거 잘하면 먹고 살 수 있을까?)

3. Bandit 게임 설명

- 문제설명: 두 개의 버튼이 있다. 버튼0을 누르면 1의 보상을, 버튼1을 누르면 10의 보상을 준다고 가정

Agent: 버튼0을 누르거나,버튼1을 누르는 존재
Env: Agent의 Action을 바탕으로 Reward를 주는 존재

주의: 이 문제 상황에서 state는 없음

- 생성형AI로 위의 상황을 설명한것

클로드로 생성: https://claude.ai/public/artifacts/1f52fcb2-ef08-4af1-8cf8-4a497d7bcc5f

- 게임진행양상

처음에는 아는게 없음. 일단 “아무거나” 눌러보자. (“에이전트가 랜덤액션을 한다” 고 표현함 )
한 20번 정도 눌러보면서 결과를 관찰함 (“에이전트가 경험을 축적한다”고 표현함)
버튼0을 누를때는 1점, 버튼1을 누를때는 10점을 준다는 사실을 깨달음. (“에이전트가 환경을 이해했다”고 표현함)
버튼1을 누르는게 나한테 이득이 라는 사실을 깨달음. (“에이전트가 최적의 정책을 학습했다” 고 표현함)
이제부터 무조건 버튼1만 누름 \(\to\) 게임 클리어 (“강화학습 성공”이라 표현할 수 있음)

- 어떻게 버튼1을 누르는게 이득이라는 사실을 아는거지? \(\to\) 아래와 같은 테이블을 만들면 된다. (q_table)

	Action0	Action1
State0	mean(Reward \| State0, Action0)	mean(Reward \| State0, Action1)

4. Bandit 환경 설계 및 풀이

A. 대충 개념만 실습

action_space = [0,1] 
actions_deque = collections.deque(maxlen=200)
rewards_deque = collections.deque(maxlen=200)
#---#
for _ in range(10):
    action = np.random.choice(action_space)
    if action ==0: 
        reward = 1
    else:
        reward = 10
    actions_deque.append(action)
    rewards_deque.append(reward)

actions_deque

deque([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], maxlen=200)

rewards_deque

deque([10, 10, 1, 1, 10, 10, 10, 1, 1, 1], maxlen=200)

actions_numpy = np.array(actions_deque)
rewards_numpy = np.array(rewards_deque)

q0 = rewards_numpy[actions_numpy==0].mean()
q1 = rewards_numpy[actions_numpy==1].mean()
q_table = np.array([q0,q1])
q_table

array([ 1., 10.])

action = q_table.argmax()

#---#
for _ in range(5):
    #action = np.random.choice(action_space)
    action = q_table.argmax()
    if action ==0: 
        reward = 1
    else:
        reward = 10
    actions_deque.append(action)
    rewards_deque.append(reward)
    actions_numpy = np.array(actions_deque)
    rewards_numpy = np.array(rewards_deque)
    q0 = rewards_numpy[actions_numpy==0].mean()
    q1 = rewards_numpy[actions_numpy==1].mean()
    q_table = np.array([q0,q1])
    q_table

if rewards_numpy[-5:].mean() > 9:
    print("GameClear")

GameClear

B. 클래스를 이용한 설계 및 풀이

class Batdit():
    def __init__(self):
        self.reward = None 
    def step(self,action):
        if action == 0:
            self.reward = 1
        elif action == 1:
            self.reward = 10
        return self.reward

class Agent():
    def __init__(self):
        self.n_experiences = 0 
        self.action_space = [0,1]
        self.action = None 
        self.actions_deque = collections.deque(maxlen=500)
        self.actions_numpy = np.array(self.actions_deque)
        self.reward = None 
        self.rewards_deque = collections.deque(maxlen=500)
        self.rewards_numpy = np.array(self.rewards_deque)
        self.q_table = None
    def act(self):
        if self.n_experiences < 20:
            self.action = np.random.choice(self.action_space)
        else: 
            self.action = self.q_table.argmax()
        print(f"버튼{self.action}누름")
    def save_experience(self):
        self.n_experiences = self.n_experiences + 1
        self.actions_deque.append(self.action)
        self.rewards_deque.append(self.reward)
        self.actions_numpy = np.array(self.actions_deque)
        self.rewards_numpy = np.array(self.rewards_deque)
    def learn(self):
        if self.n_experiences < 20:
            pass
        else: 
            q0 = self.rewards_numpy[self.actions_numpy == 0].mean()
            q1 = self.rewards_numpy[self.actions_numpy == 1].mean()
            self.q_table = np.array([q0,q1])

env = Batdit()
agent = Agent()

agent.act()

버튼1누름

for _ in range(100):
    #1. 행동
    agent.act()
    #2. 보상
    agent.reward = env.step(agent.action)
    #3. 저장 & 학습 
    agent.save_experience()
    agent.learn()
    #---#
    if (agent.n_experiences > 20) and (agent.rewards_numpy[-20:].mean() >9):
        print("게임클리어")
        break

버튼0누름
버튼1누름
버튼0누름
버튼1누름
버튼0누름
버튼1누름
버튼0누름
버튼1누름
버튼1누름
버튼0누름
버튼0누름
버튼0누름
버튼0누름
버튼0누름
버튼1누름
버튼0누름
버튼1누름
버튼1누름
버튼0누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
버튼1누름
게임클리어