On the journey of

[Deep Reinforcement Learning Hands On] Chapter.02 본문

Experiences & Study/이브와(KIBWA)

[Deep Reinforcement Learning Hands On] Chapter.02

dlrpskdi 2023. 8. 4. 09:16

2.1 The Anatomy of the Agent

2.1.1 A Simplistic Situation

  • Define an environment that gives the agent random rewards for a limited number of steps, regardless of the agent’s actions
class Environment:
	def __init__(self):
		self.steps_left = 10 # initialize its internal state

	def get_observation(self):
		return [0.0, 0.0, 0.0]

	def get_actions(self):
		return [0, 1] # agent가 실행할 수 있는 동작의 집합

	def is_done(self):
		return self.steps_left == 0 # agent 동작의 끝을 알림(signal)

	def action(self, action):
		if self.is_done():
			raise Exception("Game is over")
		self.steps_left -= 1
		return random.random()
class Agent:
	def __init__(self):
		self.total_reward = 0.0

	def step(self, env):
		current_obs = env.get_observation()
		actions = env.get_actions()
		reward = env.action(random.choice(actions))
		self.total_reward += reward
if __name__ = "__main__":
	env = Environment()
	agent = Agent()

	while not env.is_done():
		agent.step(env)

	print("Total reward got : %.4f"%agent.total_reward)

2.2 Hardware and Software Requirements

  • 교재의 코드 작성 시 사용된 Python 버전 : 3.6.
  • 사용된 파이썬 패키지
    • Numpy
    • OpenCV Python bindings
    • Gym
    • PyTorch
    • Ptan
  • 코드의 더 빠른 실행을 위해서는 GPU에 접근 가능해야 함.
    • CUDA에 적합한 GPU를 구매
    • 아마존의 AWS, 구글의 Google Cloud같은 GPU로 동작하는 클라우드 인스턴스 이용
  • 운영체제 : Windows를 지원하지 않으므로 Linux 혹은 macOS 이용
# 참고
numpy==1.14.2
atari-py==0.1.1
gym==0.10.4
ptan==0.3
opencv-python==3.4.0.12
scipy==1.0.1
torch==0.4.0
torchvision==0.2.1
tensorboard-pytorch==0.7.1
tensorflow==1.7.0
tensorboard==1.7.0

2.3 OpenAI Gym API

  • Environment에 포함된 정보들
    • A set of actions that are allowed to be executed in an environment. Gym supports both discrete and continuous actions, as well as their combination
    • The shape and boundaries of the observations that an environment provides the agent with
    • A method called step to execute an action, which returns the current observation, reward, and indication that the episode is over
    • A method called reset to return the environment to its initial state and to obtain the first observation

2.3.1 Action Space

  • Action space는 이산형(discrete)일수도, 연속형(continuous)일수도 혹은 그 혼합형(combination)일수도 있음.
  • 책에 나온 정의를 일반화.
    • 이산형 : ?
    • 연속형 : ?
  • 실제로는 다양한 action들을 동시에 할 수 있는데 이를 Gym에서는 특정 container class 를 통해 다양한 action space를 하나의 action으로 묶을(nest) 수 있게 함.

2.3.2 Observation Space

💡 Observations : pieces of information that an environment provides the agent with, on every timestamp, besides the reward

  • Observations can be as simple as a bunch of numbers or as complex as several multi dimensional tensors containing color images from several cameras
  • Similarity between actions and observations (representations in Gym classes below)
  •  

  • Abstract Class Space
    • sample() : returns a random sample from the space
    • contains(x) : checks if the argument x belongs to the space's domain
  • Class Discrete
    • represents a mutually-exclusive set of items, numbered from 0 to n−1
    • 예를 들어 action space에서 n개의 action이 가능함을 의미
  • Class Box
    • represents an n-dimensional tensor of rational numbers with intervals [low, high]
    • shape argument is assigned a tuple of length 1 with a single value of 1, which gives us a one-dimensional tensor with a single value
  • Class Tuple
    • combine several Space class instances together
    • enables us to create action and observation spaces of any complexity that we want
  • There are other Space subclasses defined in Gym, but the preceding three are the most useful ones we'll deal with
  • Every environment has two members of type Space, called action_space and observation_space. This allows you to create generic code, which could work with any environment.

2.3.3 The Environment

  • The environment is represented in Gym by the **Env** class, which has the following members:
    • action_space : This is the field of the Space class, providing a specification for allowed actions in the environment.
    • observation_space : This field has the same Space class, but specifies the observations provided by the environment.
    • reset() : This resets the environment to its initial state, returning the initial observation vector
      • you have to call reset after the creation of the environment
      • after the end of the episode, an agent needs to start over. The value returned by this method is the first observation of the environment
    • step() : This method allows the agent to give the action and returns the information about the outcome of the action: the next observation, local reward, and end-of-episode flag.
      • central piece in the environment's functionality
        • Telling the environment which action we'll execute on the next step
        • Getting the new observation from the environment after this action
        • Getting the reward the agent gained with this step
        • Getting the indication that the episode is over
    • render() : allows you to obtain the observation in a human-friendly form, but we won't use them
  • The first item (action) is passed as the only argument to this method, and the rest is returned by function.
  • it's a (python) tuple of four elements (observation, reward, done, and extra_info)
    • observation : This is a NumPy vector or a matrix with observation data.
    • reward : This is the float value of the reward.
    • done : This is a Boolean indicator, which is True when the episode is over.
    • extra_info : This could be anything environment-specific with extra information about the environment. The usual practice is to ignore this value in general RL methods (not taking into account the specific details of the particular environment).

2.3.4 Creation of the Environment

  • How do we create **Env** objects in the first place
  • make(env_name) : env_name in string form
  • Gym v0.9.3 contains 777 environments with different names (but not unique)
    • 버전의 다양성
    • 설정(setting)의 다양성
    • observation space의 다양성
  • Example Atari game Breakout
    • Breakout-v0, Breakout-v4 : The original breakout with a random initial position and direction of the ball
    • BreakoutDeterministic-v0, BreakoutDeterministic-v4 : Breakout with the same initial placement and speed vector of the ball
    • BreakoutNoFrameskip-v0, BreakoutNoFrameskip-v4 : Breakout with every frame displayed to the agent
    • Breakout-ram-v0, Breakout-ram-v4 : Breakout with observation of full Atari emulation memory (128 bytes) instead of screen pixels.
    • Breakout-ramDeterministic-v0, Breakout-ramDeterministic-v4
    • Breakout-ramNoFrameskip-v0, Breakout-ramNoFrameskip-v4

 

 

 

  • Even after the removal of such duplicates, Gym 0.9.3 comes with an impressive list of 116 unique environments, which can be divided into several groups:
    • Classic control problems: These are toy tasks that are used in optimal control theory and RL papers as benchmarks or demonstrations. They are usually simple, with a low-dimension observation and action spaces, but they are useful as quick checks when implementing algorithms. Think about them as the "MNIST for RL" (in case you haven't heard about MNIST, it is a handwriting digit recognition dataset from Yann LeCun).
    • Atari 2600: These are games from the classic game platform from the 1970s. There are 63 unique games.
    • Algorithmic: These are problems that aim to perform small computation tasks, such as copying the observed sequence or adding numbers.
    • Board games: These are the games of Go and Hex.
    • Box2D: These are environments that use the Box2D physics simulator to learn walking or car control.
    • MuJoCo: This is another physics simulator used for several continuous control problems.
    • Parameter tuning: This is RL being used to optimize neural network parameters.
    • Toy text: These are simple grid-world text environments.
    • PyGame: These are several environments implemented using the PyGame engine.
    • Doom: These are nine mini-games implemented on top of ViZdoom.
  • An even larger set of environments is available in the OpenAI Universe, which provides general connectors to virtual machines, while running Flash and native games, web browsers, and other real-world applications.
  • OpenAI Universe extends the Gym API, but follows the same design principles and paradigm. You can check it out at https://github.com/openai/universe.

The CartPole session

  • 막대를 좌우로 움직여 쓰러트리지 않게 학습하는 예제
  • 카트폴의 설정은 360도로 돌아가며, 좌우로 균형을 맞추지 않으면, 중력에 의해 막대가 쓰러짐 → 보상을 통한 학습 방법(강화학습)을 통해 막대를 최대한 쓰러지지 않게 유지하는 게 관건

카트폴은 막대의 질량, 가속도, 막대 각도 등 수학,물리적 지식이 요구됨. 따라서, 이를 action을 취하기 위한 vector로 바꾸기엔 어려움. → 이때, 쓰러지지 않았을 때의 보상을 1로 두어 앞선 사전지식 없이 강화학습을 구축함.

import gym
# 환경 생성
e = gym.make('CartPole-v0') #'CartPole-v0'-> 'CartPole-v1'사용
# 또한, gym.make('CartPole-v0')-> gym.make('CartPole-v1', render_mode="rgb_array")
# render_mode를 추가해야함.
# 카트폴 가상환경의 초기 관측 세팅
obs = e.reset()
obs
>>(array([0.04792098, 0.02994441, 0.03173691, 0.04775174], dtype=float32), {})
# 현재 gym설정상 obs가 아닌, obs[0]을 봐야할 것 같음.

obs[0]
>> (array([0.04792098, 0.02994441, 0.03173691, 0.04775174], dtype=float32)

→ action space : 좌우 선택이므로 Discrete(# = 2)

→ observation_space : 좌,우 선택시 state, state 결정 

# 0:즉, 왼쪽으로 막대를 밀었을 때 상태 변화
# state, reward, done, (?), info
# 원랜 state, reward, done, info였으나 현재 한개가 더 추가됨
e.step(0)
>>(array([-0.03162944, -0.14868328,  0.03413786,  0.2544517 ], dtype=float32),
 1.0,
 False,
 False,
 {})

카트폴 환경 구성

import gym

if __name__ == "__main__":
	env = gym.make('CartPole-v1', render_mode="rgb_array")
	total_reward = 0.0
	total_steps = 0
	obs = env.reset()[0] # {}값이 추가되어 env.reset()[0]을 해야 원하는 array반환

Step count, 행동 선택별 reward계산

while True:
	action = env.action_space.sample()
# 여기도 False, {}로 False값이 추가돼, 이를 받아줘야함.
	obs, reward, done, _ ,_ = env.step(action)
	total_reward += reward
	total_steps += 1
	if done:
		break
	print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

>> Episode done in 1 steps, total reward 1.00
Episode done in 2 steps, total reward 2.00
Episode done in 3 steps, total reward 3.00
Episode done in 4 steps, total reward 4.00
Episode done in 5 steps, total reward 5.00
Episode done in 6 steps, total reward 6.00...

→ 무작위로 좌 우로 움직이면서 이에 대한 state(obs)을 계산하고 이 행동의 보상, episode종료 여부(좌 우로 움직이는 행동을 그만할지)를 계산함.

→ episode가 끝나면(즉, done= True) loop을 끝내고 각 스텝별 토탈 보상값(즉, 얼마나 안쓰러뜨리고 좌 우로 움직이면서 버텼는지)를 반환함.

 

결과)

  • 우리의 setting의 경우 평균 12-15번정도 버티고 쓰러짐
  • 그런, 대부분의 Gym환경에선 ‘reward boundary’인 평균 보상값이 존재함. 이는, 보통 100번정도이고 카트폴의 경우 195번임. → 반복적인 학습 방법으로 좋은 성능으로 귀결될 수 있음.

The extra Gym functionality- wrappers and monitors

필수적이진 않지만 알아두면 좋을 API 추가 정보!

Wrappers : 하나의 frame이 충분한 정보를 주지 못해, 현재 환경에 추가적인 로직을 더하는 것

[1] 현재 obs를 some buffer에 쌓고, N개의 obs를 agent에게 넘겨줄 때 → 컴퓨터 게임에서 하나의 frame이 게임 상태의 충분한 정보를 주지 못할 때(게임 예시??)

[2] 세부 정보가 아닌 간략한 정보 축약된 값이 필요할 때 → 이미지를 자르거나 전처리 or 보상값 normalize할 때( 보상값을 normailize하는 건 언제일까??)

 

상황>

학습과정에서 action에 eps만큼의 randomness를 주어 exploration/exploitation을 통해 policy를 선정함.

예로, reward가 최대가 되는 action을 취하는 것이 아니라, 10%만큼의 확률로 랜덤하게 행동을 하고 그때 나오는 state와 reward를 통해 학습을 진행.

→ 실제로, randomness를 주었을 때, 기존의 각 step별 reward가 최대가되는 방향으로 학습하는 강화학습보다 더 좋은 성능을 보임.

(* Monitor : 강화학습 동작 과정을 볼 수 있게 해주는 툴)

import gym
import random
# wrapper 초기화 
class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env, epsilon = 0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon
def action(self, action):
    if random.random() < self.epsilon:
        print("Random!")
        return self.env.action_sapce.sampel()
    return action
# 카트폴 예제 적용
if __name__ == '__main__':
    env = RandomActionWrapper(gym.make('CartPole-v1', render_mode="rgb_array"))
env
>><RandomActionWrapper<TimeLimit<OrderEnforcing<PassiveEnvChecker<CartPoleEnv<CartPole-v1>>>>>>

#참고
env = gym.make('CartPole-v1',render_mode = 'rgb_array')
>> <TimeLimit<OrderEnforcing<PassiveEnvChecker<CartPoleEnv<CartPole-v1>>>>>