๊ด€๋ฆฌ ๋ฉ”๋‰ด

On the journey of

[Deep Reinforcement Learning Hands On] Chapter.01 ๋ณธ๋ฌธ

Experiences & Study/์ด๋ธŒ์™€(KIBWA)

[Deep Reinforcement Learning Hands On] Chapter.01

dlrpskdi 2023. 8. 4. 08:58

Chapter 1. What is Reinforcement Learning?

 ๐Ÿ’ก Reinforcement Learning (RL) : an approach that natively incorporates extra dimension (which is usually time, but not necessarily) into learning equations

๊ฐ•ํ™”ํ•™์Šต : ์–ด๋–ค ํ™˜๊ฒฝ ์•ˆ์—์„œ ์ •์˜๋œ ์—์ด์ „ํŠธ๊ฐ€ ํ˜„์žฌ์˜ ์ƒํƒœ๋ฅผ ์ธ์‹ํ•˜์—ฌ, ์„ ํƒ ๊ฐ€๋Šฅํ•œ ํ–‰๋™๋“ค ์ค‘ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ–‰๋™ ํ˜น์€ ํ–‰๋™ ์ˆœ์„œ๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ํ•™์Šต

 

Learning - supervised, unsupervised, and reinforcement

์ง€๋„, ๋น„์ง€๋„ ํ•™์Šต๊ณผ ๋น„๊ต๋ฅผ ํ†ตํ•ด ๊ฐ•ํ™” ํ•™์Šต์˜ ํŠน์ง•์„ ์ •์˜

  1. supervised learning ์ง€๋„ ํ•™์Šต
  • main objective : we have many examples of the input and desired output, and we want to learn how to generate the output for some future, currently unseen inputs
  • ex) text classification, image classification, regression problems etc
  • 2. unsupervised learning ๋น„์ง€๋„ ํ•™์Šต
  • main objective : to learn some hidden structure of the dataset at hand
  • assumes no supervision that has no known labels assigned to our data
  • ex) clustering, GANs
  1. reinforcement learing ๊ฐ•ํ™”ํ•™์Šต
  • lays somewhere in between full supervision and a complete lack of predefined labels
  • uses many well- established methods of supervised learning, but in a different way

 ๐Ÿ“ ๊ฐ•ํ™”ํ•™์Šต์€ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ์˜์—ญ์œผ๋กœ, ์ง€๋„ํ•™์Šต์˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ด์šฉํ•˜๋ฉด์„œ ๋น„์ง€๋„ํ•™์Šต์ฒ˜๋Ÿผ ๋ผ๋ฒจ๋ง์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๋Š” ๋…ํŠนํ•œ ํ•™์Šต๋ฒ•์ด๋‹ค.

Figure 1 : Robot mouse maze world

  1. environemnt : a maze with food at some points and electricity at others
  2. robot mouse : can take actions such as turn left/right and move forward
  3. can observe the full state of the maze to make a decision about the actions
  4. trying to find as much food as possible, while avoiding an electric shock whenever possible

final goal of the agent is to get as much total reward as possible

  • RL doesn’t work with predefined labels, so there’s no label for good or bad or the best direction
  • three state in reward system : positive, negative, neutral
  • what makes RL tricky?
    1. having non-i.i.d data (i.i.d : independent and identically distributed) ๊ฐ๊ฐ์˜ ๋žœ๋ค ๋ณ€์ˆ˜๋“ค์ด ๋™์ผํ•œ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€์ง€ ์•Š์Œ sqential ์‹œ๊ณ„์—ด์„ฑ์ด ์ค‘์š”
      1. observation in RL depends on an agent's behavior and to some extent, it is the result of their behavior
      2. the observations tell nothing about the action of your agent even it decides to do inefficient things
      3. if the agent is stubborn and keeps making mistakes, the observations can make the false impression, which is totally wrong
    2. exploration/exploitation dilemma ์ด์šฉ๊ณผ ํƒํ—˜ ์‚ฌ์ด์˜ ๊ท ํ˜•
      1. your agent needs to not only exploit the policy they have learned, but to actively explore the environment
      2. by doing things differently we can significantly improve the outcome we get but, too much exploration may also seriously decrease the reward
      3. need to find a balance between these two activities
    3. reward can be seriously delayed from actions ํ–‰๋™์— ๋Œ€ํ•œ ํ”ผ๋“œ๋ฐฑ์ด ์ฆ‰๊ฐ์ ์ด์ง€ ์•Š๊ณ  ์ง€์—ฐ ๊ฐ€๋Šฅ
      1. we need to discover such casualties, which can be tricky to do over the flow of time and our actions

 

RL formalisms and relations

- Figure 2 : RL entities and their communications

 

  • Reward ๋ณด์ƒ
    • a scalar value we obtain periodically from the environment
    • purpose : to tell our agent how well they have behaved
    • don't define how frequently the agent receives this reward
    • local - reflects the success of the agent's recent activity
    • reinforce agent’s behavior in a positive or negative way - agent’s goal is to achieve the largest accumulated reward over its sequence of actions
  • The agent ์—์ด์ „ํŠธ
    • somebody or something who/which interacts with the environment by executing certain actions, taking observations, and receiving eventual rewards for this
    • supposed to solve some problem in a more-or-less efficient way
  • The environment ํ™˜๊ฒฝ
    • everything outside of an agent
    • agent’s communication with the environment is limited by rewards, actions and observations
  • Actions ํ–‰๋™
    • things that an agent can do in the environment
    • two types of actions - discrete or continuous
  • Observations ๊ด€์ฐฐ
    • pieces of information the the environment provides the agent
    • may be relevant to the upcoming reward or not, even include reward information

 

Markov decision processes

Markov process ํŒŒํŠธ์—์„œ๋Š” RL(Reinforcement Learning) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋™์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” RL์˜ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ตํž ๊ฒƒ

 

๋จผ์ € ๋ฐฉ๊ธˆ ๋…ผ์˜ํ•œ formalisms(reward, agent, actions, observations, and environment)์˜ ์ˆ˜ํ•™์  ํ‘œํ˜„๊ณผ ๊ฐœ๋…์„ ์†Œ๊ฐœํ•˜๊ณ ,  ์ด๊ฒƒ์„ ๊ธฐ์ดˆ๋กœ ํ•˜์—ฌ ์šฐ๋ฆฌ๋Š” state, episode, history, value, gain์„ ํฌํ•จํ•œ RL์–ธ์–ด์˜ 2์ฐจ์  ๊ฐœ๋…์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” ์ฑ…์˜ ํ›„๋ฐ˜๋ถ€์—์„œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๋Š”๋ฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋จ. ๊ทธ๋ฆฌ๊ณ  Markov decision process์— ๋Œ€ํ•œ ์šฐ๋ฆฌ์˜ ์„ค๋ช…์€ ๋Ÿฌ์‹œ์•„ ๋งˆํŠธ๋ฃŒ์‹œ์นด ์ธํ˜•๊ณผ ๊ฐ™์Œ

 

<๋Ÿฌ์‹œ์•„ ๋งˆํŠธ๋ฃŒ์‹œ์นด ์ธํ˜•>

์šฐ๋ฆฌ๋Š” Markov Process(MP : Markov chain)์˜ ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์‚ฌ๋ก€์—์„œ ์‹œ์ž‘ํ•˜์—ฌ rewards๋กœ ํ™•์žฅํ•˜์—ฌ Markov reward processes๋กœ ์ „ํ™˜

 

๊ทธ๋Ÿฐ ๋‹ค์Œ actions์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด ์•„์ด๋””์–ด๋ฅผ ๋‹ค๋ฅธ ๋ด‰ํˆฌ์— ๋„ฃ์Œ์œผ๋กœ์จ Markov Decision Processes(MDPs)๋กœ ์ด์–ด์งˆ ๊ฒƒ

 

 Markov processes์™€ Markov decision processes๋Š” computer science์™€ ๋‹ค๋ฅธ ๊ณตํ•™๋ถ„์•ผ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ

 

 

 

Markov process

Markov process๋Š” Markov chain์ด๋ผ๊ณ ๋„ ์•Œ๋ ค์ ธ ์žˆ์Œ

 

๋‹น์‹  ์•ž์— ์˜ค์ง ์—ฌ๋Ÿฌ๋ถ„์ด ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์–ด๋–ค ์‹œ์Šคํ…œ์ด ์žˆ๋‹ค๊ณ  ์ƒ์ƒํ•ด๋ณด๊ณ , ์—ฌ๋Ÿฌ๋ถ„์ด ๊ด€์ฐฐํ•˜๋Š” ๊ฒƒ์„ states๋ผ๊ณ  ํ•˜๋ฉฐ, ์‹œ์Šคํ…œ์€ ์—ญํ•™ ๋ฒ•์น™์— ๋”ฐ๋ผ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ.

 

์ด๋•Œ ๋‹น์‹ ์€ ์‹œ์Šคํ…œ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๊ณ  ์˜ค์ง ์ƒํƒœ๊ฐ€ ๋ณ€ํ™”ํ•˜๋Š” ๊ฒƒ์„ ์ง€์ผœ๋ณผ ๋ฟ์ด๋ฉฐ, ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ƒํƒœ๋Š” state space๋ผ๋Š” ์ง‘ํ•ฉ์„ ํ˜•์„ฑํ•˜๋Š”๋ฐ Markov processes์—์„œ ์šฐ๋ฆฌ๋Š” ์ด ์ƒํƒœ ์ง‘ํ•ฉ์ด ์œ ํ•œํ•ด์•ผ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •(๊ทธ๋Ÿฌ๋‚˜ ๋งค์šฐ ํด ์ˆ˜ ์žˆ์Œ)

 

๊ด€์ธก์น˜๋Š” ์ผ๋ จ์˜ states ๋˜๋Š” ์ฒด์ธ์„ ํ˜•์„ฑ(์ด ๋•Œ๋ฌธ์— Markov processes๋ฅผ Markov chains๋ผ๊ณ ๋„ ํ•จ)

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์–ด๋–ค ๋„์‹œ์˜ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋‚ ์”จ ๋ชจ๋ธ์„ ๋ณด๋ฉด, ์šฐ๋ฆฌ๋Š” ์ฃผ ๊ณต๊ฐ„์ธ ํ˜„์žฌ ๋‚ ์„ ๋ง‘๊ฑฐ๋‚˜ ๋น„๊ฐ€ ์˜ค๋Š” ๊ฒƒ์œผ๋กœ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์Œ

 

์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ์ผ๋ จ์˜ ๊ด€์ธก์€ [sunny, sunny, rainy, sunny, ...]์™€ ๊ฐ™์€ ์ผ๋ จ์˜ ์ฒด์ธ์„ ํ˜•์„ฑํ•˜๋ฉฐ ์ด๊ฒƒ์„ history๋ผ๊ณ  ๋ถ€๋ฆ„

 

์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์„ Markov Process๋ผ ๋ถ€๋ฅด๊ธฐ ์œ„ํ•ด์„œ๋Š” Markov property๋ฅผ ์ถฉ์กฑํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ด๋Š” ์–ด๋–ค ์ƒํƒœ์—์„œ๋„ ๋ฏธ๋ž˜์˜ ์‹œ์Šคํ…œ ์—ญํ•™์€ ์ด ์ƒํƒœ์—๋งŒ ์˜์กดํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

 

Markov property์˜ ์š”์ ์€ ์‹œ์Šคํ…œ์˜ ๋ฏธ๋ž˜๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ๊ด€์ธก ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ƒํƒœ๋ฅผ ์Šค์Šค๋กœ ํฌํ•จ์‹œํ‚ค๋Š” ๊ฒƒ

 

๋‹ค์‹œ ๋งํ•ด, Markov property๋Š” ์‹œ์Šคํ…œ ์ƒํƒœ๊ฐ€ ์„œ๋กœ ๊ตฌ๋ณ„๋  ์ˆ˜ ์žˆ๊ณ  ๊ณ ์œ ํ•ด์•ผํ•˜๋ฉฐ, ์‹œ์Šคํ…œ์˜ ๋ฏธ๋ž˜ ์—ญํ•™์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ ํ•˜๋‚˜์˜ ์ƒํƒœ๋งŒ ํ•„์š”ํ•˜๋ฉฐ ์ „์ฒด history๋Š” ์•„๋‹ˆ์–ด์•ผ ํ•จ

 

์šฐ๋ฆฌ์˜ ๋‚ ์”จ ์˜ˆ์ œ์˜ ๊ฒฝ์šฐ, Markov property๋Š” ์šฐ๋ฆฌ๊ฐ€ ๊ณผ๊ฑฐ์— ๋ดค๋˜ ๋ง‘์€ ๋‚ ์˜ ์–‘์— ์ƒ๊ด€์—†์ด ๊ฐ™์€ ํ™•๋ฅ ๋กœ ๋ง‘์€ ๋‚ ์ด ๋น„๊ฐ€ ์˜ค๋Š” ๋‚  ๋’ค์— ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋งŒ ๋ชจ๋ธ์„ ๋‚˜ํƒ€๋‚ด๋„๋ก ์ œํ•œํ•จ

 

์ƒ์‹์ ์œผ๋กœ ๋‚ด์ผ ๋น„๊ฐ€ ์˜ฌ ๊ฐ€๋Šฅ์„ฑ์€ ํ˜„์žฌ ์ƒํƒœ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ณ„์ ˆ, ์œ„๋„, ๊ทธ๋ฆฌ๊ณ  ๊ทผ์ฒ˜์˜ ์‚ฐ๊ณผ ๋ฐ”๋‹ค์˜ ์กด์žฌ์™€ ๊ฐ™์€ ๋งŽ์€ ์š”์†Œ๋“ค์— ๋‹ฌ๋ ค์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋‹ค์ง€ ํ˜„์‹ค์ ์ธ ๋ชจ๋ธ์€ ์•„๋‹˜

 

ํƒœ์–‘ ํ™œ๋™์กฐ์ฐจ๋„ ๋‚ ์”จ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์ด ์ตœ๊ทผ์— ์ฆ๋ช…๋˜์—ˆ๊ณ , ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ์˜ ์˜ˆ๋Š” ์ •๋ง ์ˆœ์ง„ํ•˜์ง€๋งŒ ํ•œ๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ทธ๊ฒƒ์— ๋Œ€ํ•œ ์˜์‹์ ์ธ ๊ฒฐ์ •์„ ๋‚ด๋ฆฐ๋‹ค๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ

 

๋ฌผ๋ก , ๋ชจ๋ธ์„ ๋” ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ํ•ญ์ƒ state space๋ฅผ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๊ณ  ์ด๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ๋” ๋งŽ์€ ์˜์กด์„ฑ์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•  ๊ฒƒ

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์—ฌ๋ฆ„๊ณผ ๊ฒจ์šธ ๋™์•ˆ ๋น„๊ฐ€ ์˜ฌ ํ™•๋ฅ ์„ ๋ณ„๋„๋กœ ํ™•๋ณดํ•˜๋ ค๋ฉด ํ•ด๋‹น ๊ณ„์ ˆ์„ ์ƒํƒœ์— ํฌํ•จํ•  ์ˆ˜ ์žˆ์Œ

 

์ด ๊ฒฝ์šฐ์˜ state space๋Š” [sunny+summer, sunny+winter, rainy+summer, rainy+winter]์ด ๋จ

 

์‹œ์Šคํ…œ ๋ชจ๋ธ์ด Markov property๋ฅผ ์ค€์ˆ˜ํ•˜๋ฏ€๋กœ NxN ํฌ๊ธฐ์˜ ์ œ๊ณฑํ–‰๋ ฌ์ธ transition matrix๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•๋ฅ ์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Œ

 

์—ฌ๊ธฐ์„œ N์€ ๋ชจ๋ธ์˜ ์ƒํƒœ ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๊ณ  ํ–‰๋ ฌ์˜ ํ–‰ i์™€ ์—ด j์— ์žˆ๋Š” ๋ชจ๋“  ์…€์€ ์‹œ์Šคํ…œ์ด ์ƒํƒœ i์—์„œ ์ƒํƒœ j๋กœ ์ „ํ™˜๋œ ํ™•๋ฅ ์„ ํฌํ•จ

 

์˜ˆ๋ฅผ ๋“ค์–ด, sunny/rainy๊ฐ€ ์˜ค๋Š” ์˜ˆ์‹œ์—์„œ ์ „์ด ํ–‰๋ ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ์ˆ˜ ์žˆ์Œ

 

  sunny rainy
sunny 80% 20%
rainy 10% 90%

์ด ๊ฒฝ์šฐ, ๋ง‘์€ ๋‚ ์ด ์žˆ๋‹ค๋ฉด, ๋‹ค์Œ ๋‚ ์ด ๋ง‘์„ ํ™•๋ฅ ์€ 80%์ด๊ณ , ๋‹ค์Œ ๋‚ ์ด ๋น„๊ฐ€ ์˜ฌ ํ™•๋ฅ ์€ 20%์ด๊ณ  ๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ ๋น„๊ฐ€ ์˜ค๋Š” ๋‚ ์„ ๊ด€์ฐฐํ•œ๋‹ค๋ฉด, ๋‚ ์”จ๊ฐ€ ์ข‹์•„์งˆ ํ™•๋ฅ ์€ 10%์ด๊ณ , ๋‹ค์Œ๋‚  ๋น„๊ฐ€ ์˜ฌ ํ™•๋ฅ ์€ 90% ์ž„

 

Markov process์˜ ๊ณต์‹์  ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

  • ์‹œ์Šคํ…œ์ด ์žˆ์„ ์ˆ˜ ์žˆ๋Š” ์ƒํƒœ ์ง‘ํ•ฉ(S)
  • ์‹œ์Šคํ…œ ์—ญํ•™์„ ์ •์˜ํ•˜๋Š” ์ „์ด ํ™•๋ฅ ๊ณผ ์ „์ด ํ–‰๋ ฌ(T)

MP์˜ ์œ ์šฉํ•œ ์‹œ๊ฐ์  ํ‘œํ˜„์€ ์‹œ์Šคํ…œ ์ƒํƒœ์™€ ๊ฐ€์žฅ์ž๋ฆฌ์— ํ•ด๋‹นํ•˜๋Š” ๋…ธ๋“œ๊ฐ€ ์žˆ๋Š” ๊ทธ๋ž˜ํ”„์ด๋ฉฐ, ์ƒํƒœ์—์„œ ์ƒํƒœ๋กœ์˜ ๊ฐ€๋Šฅํ•œ ์ „ํ™˜์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ™•๋ฅ ๋กœ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋จ

 

์ „์ด ํ™•๋ฅ ์ด 0์ด๋ฉด ๊ฐ€์žฅ์ž๋ฆฌ๋ฅผ ๊ทธ๋ฆฌ์ง€ ์•Š๊ณ (ํ•œ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ์ด๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—†์Œ) ์ด๋Ÿฌํ•œ ์ข…๋ฅ˜์˜ ํ‘œํ˜„์€ automata theory์—์„œ ์—ฐ๊ตฌ๋˜๋Š” finite state machine์—์„œ๋„ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ

 

sunny/rainy ๋‚ ์”จ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๊ทธ๋ž˜ํ”„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

<Sunny/Rainy weather model>

 

๋” ๋ณต์žกํ•œ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณด๋ฉด, ์ง์žฅ์ธ์˜ ๋˜ ๋‹ค๋ฅธ ๋ชจ๋ธ(Scott Adams์˜ ์œ ๋ช…ํ•œ ๋งŒํ™”์— ๋‚˜์˜ค๋Š” ์ฃผ์ธ๊ณต Dilbert๊ฐ€ ์ข‹์€ ์˜ˆ)์„ ์†Œ๊ฐœํ•ด๋ด„

 

์ด ์˜ˆ์‹œ์—์„œ state space๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

  • Home : ์‚ฌ๋ฌด์‹ค์— ์—†์Œ
  • Computer : ์‚ฌ๋ฌด์‹ค์—์„œ ์ปดํ“จํ„ฐ๋กœ ์ผํ•˜๊ณ  ์žˆ์Œ
  • Coffee : ์‚ฌ๋ฌด์‹ค์—์„œ ์ปคํ”ผ๋ฅผ ๋งˆ์‹œ๊ณ  ์žˆ์Œ
  • Chatting : ์‚ฌ๋ฌด์‹ค์—์„œ ๋™๋ฃŒ๋“ค๊ณผ ๋ฌด์–ธ๊ฐ€๋ฅผ ์˜๋…ผํ•˜๊ณ  ์žˆ์Œ

์ƒํƒœ ์ „ํ™˜ ๊ทธ๋ž˜ํ”„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

<State transition graph>

์šฐ๋ฆฌ๋Š” ๊ทธ์˜ ๊ทผ๋ฌด์ผ์ด ๋ณดํ†ต Home์—์„œ ์‹œ์ž‘๋˜๋ฉฐ ์˜ˆ์™ธ ์—†์ด ํ•ญ์ƒ Coffee์™€ ํ•จ๊ป˜ ๊ทผ๋ฌด๋ฅผ ์‹œ์ž‘ํ•˜๊ธธ ๊ธฐ๋Œ€ํ•จ(no Home → Computer edge and no Home → Chatting edge)

 

์œ„ ๋‹ค์ด์–ด๊ทธ๋žจ์€ ๋˜ํ•œ ๊ทผ๋ฌด์ผ์ด ํ•ญ์ƒ ์ปดํ“จํ„ฐ ์ƒํƒœ์—์„œ ์ข…๋ฃŒ๋จ์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์ด์ „ ๋‹ค์ด์–ด๊ทธ๋žจ์˜ ์ „์ด ํ–‰๋ ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

  Home Coffee Chat Computer
Home 60% 40% 0% 0%
Coffee 0% 10% 70% 20%
Chat 0% 20% 50% 30%
Computer 20% 20% 10% 50%

 

์ „ํ™˜ ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒํƒœ ์ „์ด ๊ทธ๋ž˜ํ”„์— ์ง์ ‘ ๋ฐฐ์น˜ํ•  ์ˆ˜ ์žˆ์Œ

<State transition graph with transition probabilities>

์‹ค์ œ๋กœ, ์šฐ๋ฆฌ๋Š” ์ •ํ™•ํ•œ ์ „์ด ํ–‰๋ ฌ์„ ์•„๋Š” ๊ฒƒ์€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•จ

 

ํ›จ์”ฌ ๋” ์‹ค์ œ์ ์ธ ์ƒํ™ฉ์€ ์‹œ์Šคํ…œ ์ƒํƒœ์— ๋Œ€ํ•œ ๊ด€์ฐฐ๋งŒ ์žˆ์„ ๋•Œ์ด๋ฉฐ, ์ด๋ฅผ episode๋ผ๊ณ  ํ•จ

  • home -> coffee -> coffee -> chat -> chat -> coffee -> computer -> computer -> home
  • computer ->  computer -> chat -> chat -> coffee -> computer -> computer -> computer
  • home -> home -> coffee -> chat -> computer -> coffee -> coffee

์šฐ๋ฆฌ์˜ ๊ด€์ฐฐ์— ์˜ํ•ด ์ „์ด ํ–‰๋ ฌ์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ณต์žกํ•˜์ง€ ์•Š์Œ; ์šฐ๋ฆฌ๋Š” ๋‹จ์ง€ ๋ชจ๋“  ์ƒํƒœ์—์„œ ๋ชจ๋“  ์ „์ด๋ฅผ ์„ธ๊ณ  ๊ทธ๊ฒƒ๋“ค์„ 1์˜ ํ•ฉ์œผ๋กœ ์ •๊ทœํ™”ํ•˜๋ฉด ๋จ


๊ด€์ธก ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ถ”์ •์น˜๋Š” ์‹ค์ œ ๊ธฐ๋ณธ ๋ชจํ˜•์— ๋” ๊ฐ€๊นŒ์šธ ๊ฒƒ


๋˜ํ•œ Markov property๋Š” stationarity(์ฆ‰, ๋ชจ๋“  ์ƒํƒœ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ์ „์ด ๋ถ„ํฌ๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๋ณ€ํ•˜์ง€ ์•Š์Œ)์„ ์˜๋ฏธํ•œ๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ

 

Nonstationarity๋Š” ์šฐ๋ฆฌ์˜ ์‹œ์Šคํ…œ ์—ญํ•™์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ˆจ๊ฒจ์ง„ ์š”์ธ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด ์š”์ธ์€ ๊ด€์ฐฐ์— ํฌํ•จ๋˜์ง€ ์•Š์Œ

 

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ์ „์ด ์ด๋ ฅ์— ๊ด€๊ณ„์—†์ด ๋™์ผํ•œ ์ƒํƒœ์— ๋Œ€ํ•ด ๊ธฐ๋ณธ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ๋™์ผํ•ด์•ผ ํ•˜๋Š” Markov property์™€ ๋ชจ์ˆœ๋จ

 

ํ•œ ์—ํ”ผ์†Œ๋“œ์—์„œ ๊ด€์ฐฐ๋œ ์‹ค์ œ ์ „์ด์™€ ์ „์ด ํ–‰๋ ฌ์—์„œ ์ฃผ์–ด์ง„ ๊ธฐ๋ณธ ๋ถ„ํฌ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”

 

์šฐ๋ฆฌ๊ฐ€ ๊ด€์ฐฐํ•˜๋Š” ๊ตฌ์ฒด์ ์ธ ์—ํ”ผ์†Œ๋“œ๋Š” ๋ชจ๋ธ์˜ ๋ถ„ํฌ์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋ง๋˜๋ฏ€๋กœ ์—ํ”ผ์†Œ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋‚˜ ํ‘œ๋ณธ์ด ์ถ”์ถœ๋  ๊ตฌ์ฒด์ ์ธ ์ „์ด ํ™•๋ฅ ์€ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด Markov chain formalism์€ ์ ์šฉ๋˜์ง€ ์•Š์Œ


์ด์ œ ๋” ๋‚˜์•„๊ฐ€ Markov process ๋ชจ๋ธ์„ ํ™•์žฅํ•˜์—ฌ RL ๋ฌธ์ œ์— ๋” ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ :)

 

Markov reward process (MRP; ๊ฐ•ํ™”ํ•™์Šต์„ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํ™•๋ฅ  ๋ชจ๋ธ)

  • ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ •์„ ํ™•๋ฅ ๊ณผ ๊ทธ๋ž˜ํ”„๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•œ ๊ฒƒ์œผ๋กœ, ๊ธฐ์กด์˜ MP์— R ๊ณผ ๊ฐ๋งˆ(Discount factor; ํ• ์ธ์š”์†Œ)๊ฐ€ ์ถ”๊ฐ€๋œ ๋ชจ๋ธ์ด๋‹ค.

-MP๋ž€ Markov Process์˜ ์ค„์ž„๋ง๋กœ MP(์•„๋ž˜ MP)๊ฐ€ ์ „์ œ๋œ ์ƒํ™ฉ(๋‹ค์Œ ์ƒํƒœ๋Š” ํ˜„์žฌ์ƒํƒœ์—๋งŒ ์˜์กดํ•˜๋ฉฐ, ํ™•๋ฅ ์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ๊ฒฝ์šฐ)์˜ ์ƒํƒœ ๋ณ€ํ™”

cf.MP(Markov Property): ‘ํ™•๋ฅ ’ state์—์„œ ๋ฏธ๋ž˜ ์ƒํƒœ ์˜ˆ์ธก ์‹œ ๊ณผ๊ฑฐ๊ฐ€ ์•„๋‹Œ, ‘ํ˜„์žฌ ์ƒํƒœ’๋งŒ์„ ๊ณ ๋ คํ•˜๊ฒ ๋‹ค๋Š” ๊ฐ€์ •

MDP์—์„œ์˜ Reward๋Š” ์•„๋ž˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์Œ :

์œ„์™€ ๊ฐ™์€ ์‹์„ ๋”ฐ๋ฅด๋ฉฐ ์ด์‹์˜ ์กฐ๊ฑด๋ถ€ ํ‰๊ท (๊ธฐ๋Œ“๊ฐ’)์„ ๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด

 

2. ์ถ”๊ฐ€์„ค๋ช…(์ถ”๊ฐ€์š”์†Œ)

-Reward ํ•จ์ˆ˜(R): ํ˜„์žฌ state์— ๋Œ€ํ•œ reward์˜ ๊ธฐ๋Œ“๊ฐ’์„ ํ‘œํ˜„ํ•˜๋Š” ํ•จ์ˆ˜

-Gamma ๊ฐ’(๊ฐ์‡„์ธ์ž) : ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋ฉฐ, 0~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉด์„œ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ(return)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

-์ตœ์ข…์ ์œผ๋กœ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ์„ ์ž‘๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ชฉ์ ์ด๊ธฐ ๋•Œ๋ฌธ(์ตœ์†Œ ๊ฒฝ๋กœ๋กœ ๊ฐ„๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ)

2-1. ๊ต์žฌ ์˜ˆ์‹œ (A diagram with rewards); figure 7

*Interpretation: v(s) - ๊ธฐ๋Œ“๊ฐ’(MRP๋กœ ์–ป์„ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ํ‰๊ท ๊ฐ’)

์œ„ ๊ทธ๋ฆผ์—์„œ์˜ ์ˆซ์ž๋Š” return values๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

๊ฐ€. Dilbert Reward Process (DRP)๋ฅผ ์‚ฌ์šฉํ•œ ์ถ”์ •

Q. Gamma(๊ฐ์‡„์ธ์ž) = 0์ผ ๋•Œ , Chat ์ƒํƒœ์—์„œ์˜ ๋ณ€ํ™” ์ƒ๊ฐํ•˜๊ธฐ

A. Depends on chance(Gamma๋ฅผ ๋ชจ๋ฅผ ๋•Œ). But ์ง€๊ธˆ์€ Gamma=0์ž„์„ ์•„๋‹ˆ๊นŒ ๊ณ„์‚ฐํ•ด๋ณด์ž.

Why ?

  • Dilbert Process์— ์˜ํ•˜๋ฉด Chat ์ƒํƒœ์—์„  ํฌ๊ฒŒ 3๊ฐ€์ง€ ์ง„๋กœ(?)๊ฐ€ ๊ฐ€๋Šฅ
  • 50%(0.5) - Chat
  • 30%(0.3) - Computer
  • 20%(0.2) - Coffee
  • ์ด๋•Œ, ์ „์ œ์—์„œ ๊ฐ๋งˆ๊ฐ’=0์ด๋ผ๊ณ  ํ•˜์˜€์œผ๋ฏ€๋กœ, ์˜ค๋กœ์ง€ ํ™•๋ฅ ๋งŒ์œผ๋กœ Chat ์ƒํƒœ์—์„œ์˜ value๋ฅผ ์ธก์ •ํ•ด์•ผ ํ•จ.

์ธก์ •(๊ณ„์‚ฐ) ๋ฐฉ์‹์€ ์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” ‘๊ธฐ๋Œ“๊ฐ’ ๊ตฌํ•˜๋Š”’ ๋ฐฉ์‹๊ณผ ๋˜‘๊ฐ™๋‹ค:

: ๊ฐ€์žฅ valuable state๋Š”Computer์ด ๋˜๋Š” ๊ฒƒ

๋งˆ์ฐฌ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ๊ฐ๋งˆ๊ฐ’=1์ผ ๋•Œ์˜ value๋ฅผ ์ธก์ •ํ•ด ๋ณธ๋‹ค๋ฉด?

-์ด๋Ÿฐ ๊ฒฝ์šฐ ๋ฏธ๋ž˜ ์ƒํƒœ์˜ ๋ฌดํ•œํ•œ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ์ถ”๋ก ํ•˜๊ฒŒ ๋˜๊ธฐ์— ‘infinite for all states’๋ผ๋Š” ๊ฒฐ๋ก ์ด ๋„์ถœ

 

3.Markov decision process(MDP; ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ •)

: MRP + Decision(๊ฒฐ์ •)์„ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์œผ๋กœ, reward values์™€ decisions๋ฅผ ๊ฐ™์ด ๊ณ ๋ คํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

: MRP์™€์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋Š” agent(state)์—๊ฒŒ ์„ ํƒ์ง€๊ฐ€ ์ฃผ์–ด์ง€๋Š”์ง€ ๊ทธ ์—ฌ๋ถ€๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค(action).

MRP์—์„  ํ•™์Šต์ด ๋ถˆ๊ฐ€ํ•˜๋‹ค. ์ •ํ•ด์ง„ ํ™•๋ฅ ์— ์˜ํ•ด transition(์ „์ด)๋˜๊ณ , ์ •ํ•ด์ง„ ์–‘์˜ ๋ณด์ƒ์„ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ.

MDP์—์„  action์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ reward๋ฅผ ํ†ตํ•ด value(๊ฐ€์น˜)๋ฅผ ๋งค๊ธฐ๊ณ , ์ด๋ฅผ ๋‹ค์Œ ๊ฐ™์€ ์ƒํƒœ(state)์—์„œ์˜ ์„ ํƒ ์‹œ์— ์ฐธ๊ณ ํ•˜๊ฒŒ ๋œ๋‹ค.

3-1. action(Agent์˜ action)

: agent์˜ ๊ด€์ ์—์„œ value๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. state์— ๋Œ€ํ•œ value๋ฟ ์•„๋‹ˆ๋ผ, agent๊ฐ€ ํ•˜๋Š” action์— ๋Œ€ํ•ด์„œ๋„ value๋ฅผ ์ธก์ •ํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ.

3-2. policy: action์„ ์ •ํ•˜๋Š” ํ•จ์ˆ˜(Mapping ํ•จ์ˆ˜)

์ˆ˜์‹์œผ๋กœ๋Š” ์•„๋ž˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™๋‹ค.