Notice

[공지] About this blog, and⋯

Recent Posts

Recent Comments

Link

공부가 아닌, 일상을 담는 블로그

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

On the journey of

[Deep Reinforcement Learning Hands On] Chapter.04 본문

Experiences & Study/이브와(KIBWA)

[Deep Reinforcement Learning Hands On] Chapter.04

dlrpskdi 2023. 8. 4. 09:39

Deep Reinforcement Learning 내용정리는 파알 신입기수 때(...무려 1년 전) 공부하면서 노션에 정리한 내용을 복습하며 티스토리로 옮겨온 내용입니다. 때문에 학회 노션에 최적화된 구조와 내용임을 밝혀둡니다
모종의 이유로 학회활동 당시 Chapter 3는 공부하지 않았었는데 왜 그랬는지는 잘 모르겠습니다
1. The Cross-Entropy Method
- DQN or Advantage Actor-Critic과 같은 다른 도구보다 훨씬 덜 유명하지만, 자체적인 강점이 있다
  1. 단순성: 교차 엔트로피 방법은 매우 간단 → 직관적인 방법
  2. 양호한 수렴
  - 복잡한 다단계 정책을 학습하고 발견할 필요가 없음
  - rewards가 빈번한 짧은 에피소드를 가진 단순한 환경에서는 교차 엔트로피가 일반적으로 매우 잘 작동함
4-1. Taxonomy of RL methods
- model-free
  - the method doesn't build a model of the environment or reward. = the agent takes current observations and does some computations on them, and the result is the action that it should take.
  - easier to train as it's hard to build good models of complex environments with rich observations. All of the methods described in this book are from the model-free category, as those methods have been the most active area of research for the past few years.
- model-based methods
  - try to predict what the next observation and/or reward will be.
  - pure <model-based methods> are used in deterministic environments, such as board games with strict rules.
- policy-based methods
  - Policy is usually represented by probability distribution over the available actions. In contrast, the method could be value-based. In this case, instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value.
- on-policy versus off-policy
  - For now, it will be enough to explain off-policy as the ability of the method to learn on old historical data.
- So, our cross-entropy method is model-free, policy-based, and on-policy, which means the following
  1. It doesn't build any model of the environment; it just says to the agent what to do at every step
  2. It approximates the policy of the agent
  3. It requires fresh data obtained from the environment
4-2 Practical cross-entropy
- cross-entropy method description is split into two unequal parts
  - practical → intuitive in its nature
  - theoretical → explanation of why cross-entropy works, and what's happening is more sophisticated.
- In practice
  - we follow a common ML approach and replace all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input to some output. The output may depend on a particular method or a family of methods (such as value-based versus policy-based methods) As our cross-entropy method is policy-based, our nonlinear function (neural network) produces policy, which basically says for every observation which action the agent should take.
- During the agent's lifetime
  - its experience is present as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issued, and rewards for these actions. ->　For every episode, we can calculate the total　reward that the agent has claimed.
- example
  - let's assume a discount factor of gamma = 1, which means just a sum　of all local rewards for every episode. This total reward shows how good this　episode was for the agent. a diagram, which contains four episodes.
Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. The core of the cross-entropy method is to throw away bad episodes and train on better ones. So, the steps of the method are as follows

Play N number of episodes using our current model and environment.
Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
Throw away all episodes with a reward below the boundary.
Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
Repeat from step 1 until we become satisfied with the result.

4-3 Cross-entropy on CartPole

Our model's core is a one-hidden-layer neural network, with ReLU and 128 hidden neurons Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.

HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

We define constants at the top of the file and they include the count of neurons in the hidden layer, the count of episodes we play on every iteration (16), and the percentile of episodes' total rewards that we use for elite episode filtering.

class Net(nn.Module):
def __init__(self, obs_size, hidden_size, n_actions):
super(Net, self).__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, n_actions)
)
def forward(self, x):
return self.net(x)

It takes a single observation from the environment as an input vector and outputs a number for every action we can perform. The output from the network is a probability distribution over actions, so a straightforward way to proceed would be to include softmax nonlinearity after the last layer. However rather than calculating softmax, we'll use the PyTorch class.
CrossEntropyLoss requires raw, unnormalized values from the network (also called logits), and the downside of this is that we need to remember to apply softmax every time we need to get probabilities from our network's output.

Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation',
'action'])

Episode / EpisodeStep
- EpisodeStep: This will be used to represent one single step that our agent made in the episode, and it stores the observation from the environment and what action the agent completed. We'll use episode steps from elite episodes as training data.
- Episode: This is a single episode stored as total undiscounted reward and a collection of EpisodeStep.

def iterate_batches(env, net, batch_size):
batch = []
episode_reward = 0.0
episode_steps = []
obs = env.reset()
sm = nn.Softmax(dim=1)

The preceding function accepts the environment, our neural network, and the count of episodes it should generate on every iteration. The batch variable will be used to accumulate our batch. We also declare a reward counter for the current episode and its list of steps. Then we reset our environment to obtain the first observation and create a softmax layer, which will be used to convert the network's output to a probability distribution of actions.

while True:
obs_v = torch.FloatTensor([obs])
act_probs_v = sm(net(obs_v))
act_probs = act_probs_v.data.numpy()[0]

action = np.random.choice(len(act_probs), p=act_probs)
next_obs, reward, is_done, _ = env.step(action)

We can use this distribution to obtain the actual action for the current step by sampling this distribution using NumPy's function, random.choice(). After this, we will pass this action to the environment to get our next observation, our reward, and the indication of the episode ending.
- if is_done :
```
batch.append(Episode(reward=episode_reward,
steps=episode_steps))
episode_reward = 0.0
episode_steps = []
next_obs = env.reset()
if len(batch) == batch_size:
yield batch
batch = []
```
episode_reward += reward episode_steps.append(EpisodeStep(observation=obs, action=action))
We append the finalized episode to the batch, saving the total reward and steps we've taken. Then we reset our total reward accumulator and clean the list of steps.
In case our batch has reached the desired count of episodes, we return it to the caller for processing, using yield.
After processing, we will clean up the batch
obs = next_obs
Everything repeats infinitely
- We pass the observation to the net, sample the action to perform, ask the environment to process the action, and remember the result of this processing
One very important fact to understand
- The training of our network and the generation of our episodes are performed at the same time. They are not completely in parallel, but every time our loop accumulates enough episodes (16), it passes control to this function caller, which is supposed to train the network using the gradient descent. So, when yield is returned, the network will have different, slightly better behavior.
training loop
```
train_obs = []
train_act = []
for example in batch:
if example.reward < reward_bound:
continue
train_obs.extend(map(lambda step: step.observation,
example.steps))
train_act.extend(map(lambda step: step.action,
example.steps))
```
- This function is at the core of the cross-entropy method
  - From the given batch of episodes and percentile value, it calculates a boundary reward, which is used to filter elite episodes to train on. To obtain the boundary reward, we're using NumPy's percentile function, which from the list of values and the desired percentile, calculates the percentile's value. Then we will calculate mean reward, which is used only for monitoring.

train_obs = []
train_act = []
for example in batch:
if example.reward < reward_bound:
continue
train_obs.extend(map(lambda step: step.observation,
example.steps))
train_act.extend(map(lambda step: step.action,
example.steps))

* Filter off episodes : We will convert our observations and actions from elite episodes into tensors, and return a tuple of four: observations, actions, the boundary of reward, and the mean reward. The last two values will be used only to write them into TensorBoard to check the performance of our agent.

train_obs_v = torch.FloatTensor(train_obs)
train_act_v = torch.LongTensor(train_act)
return train_obs_v, train_act_v, reward_bound, reward_mean

* Glues everything together and mostly consists of the training loop

if __name__ == "__main__":
env = gym.make("CartPole-v0")
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter()

In the beginning, we will create all the required objects: the environment, our neural network, the objective function, the optimizer, and the summary writer for TensorBoard.

for iter_no, batch in enumerate(iterate_batches(env, net,
BATCH_SIZE)):
obs_v, acts_v, reward_b, reward_m = filter_batch(batch,
PERCENTILE)
optimizer.zero_grad()
action_scores_v = net(obs_v)
loss_v = objective(action_scores_v, acts_v)
loss_v.backward()
optimizer.step()

In the training loop, we will iterate our batches, then we perform filtering of the elite episodes using the filter_batch function. The result is variables of observations and taken actions, the reward boundary used for filtering and the mean reward. After that, we zero gradients of our network and pass observations to the network, obtaining its action scores.
These scores are passed to the objective function, which calculates cross-entropy between the network output and the actions that the agent took. The idea of this is to reinforce our network to carry out those "elite" actions which have led to good rewards. Then, we will calculate gradients on the loss and ask the optimizer to adjust our network.

The rest of the loop is mostly the monitoring of progress

print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
iter_no, loss_v.item(), reward_m, reward_b))
writer.add_scalar("loss", loss_v.item(), iter_no)
writer.add_scalar("reward_bound", reward_b, iter_no)
writer.add_scalar("reward_mean", reward_m, iter_no)

The last check in the loop is the comparison of the mean rewards of our batch episodes. When this becomes greater than 199, we stop our training.

Why 199?
- Our method converges so quickly that 100 episodes are usually what we need.
- The properly trained agent can balance the stick infinitely long, but the length of the episode in CartPole is limited to 200 steps.

if reward_m > 199:
print("Solved!")
break
writer.close()

That’s it. So let’s start our first RL training:

It usually doesn’t take the agent more than 50 batches to solve the environment (usually 25~45 episodes). Tensorboard shows our agent consistently making progress, pushing the upper boundary at almost every batch.

To check our agent in action, you can enable Monitor by uncommenting the next line after the environment creation. After restarting (possibly with xvfb-run to provide a virtual X11 display), our program will create a mon directory with videos recorded at different training steps:

: As you can see from the output, it turns a periodical recording of the agent’s activity into separate video files.

Cross-entropy on FrozenLake

The FrozenLake environment
- Its world is from “grid-world” category
  - The agent lives in a grid of size 4x4
  - Can move in four directions: up, down, left, right
  - Always starts at a top-left position, and try to reach the bottom-right cell of the grid
  - Holes: If you get into those holes, the episode ends and your reward is 0
  - If the agent reaches the destination cell, the reward is 1.0 and the episode ends.
- The world is slippery (like a frozen lake): the agent do not always turn out as expected
  - 33% chance that it will slip to the right or to the left → makes progress difficult

Let’s look how this environment is represented in Gym:
- e.observation_space: discrete, which means it’s a number from 0 to 15 inclusive
  - this number is our current position in the grid
- e.action_space(): also discrete, but can be from 0 to 3

e = gym.make("FrozenLake-v0")
e.observation_space
e.action_space
e.reset()
e.render()

But our network from the CartPole example expects a vector of numbers, so we need to apply “one-hot encoding” of discrete inputs. To minimize changes in our code, we can use the ObservationWrapper class from Gym and implement our DiscreteOneHotWrapper class:

class DiscreteOneHotWrapper(gym.ObservationWrapper):
	def __init__(self, env):
		super(DiscreteOneHotWrapper, self).__init__(env)
		assert isinstance(env.observation_space,
gym.spaces.Discrete)
		self.observation_space = gym.spaces.Box(0.0, 1.0,
(env.observation_space.n, ), dtype=np.float32)

	def observation(self, observation):
		res = np.copy(self.observation_space.low)
		res[observation] = 1.0
		return res

With that wrapper applied to the environment, both the observation space and action space are 100% compatible with our CartPole solution. However, by launching it, we can see that this doesn’t improve the score over time.

Why?

First, we need to look deeper at the reward structure of both environments.
- In Cartpole, the longer our agent balanced the pole, the more reward it obtained
- Different episodes were of different lengths (due to randomness in our agent’s behavior), which gave us a pretty normal distribution of the episode’s rewards
- After choosing a reward boundary, we rejected less successful episodes and learned how to repeat better ones (by training on successful episode’s data)

But, in the FrozenLake environment:
- Episodes and their reward look different
  - We get the reward of 1.0 only when we reach the goal, and this rewards says nothing about how good each episode was
- The distribution of rewards for our episodes are also problematic:
  - There are only two kinds of episodes possible (with 0 reward and 1 reward)
  - Failed episodes will obviously dominate in the beginning of the training
  - ⇒ Our percentile selection of “elite” episodes is totally wrong

This example shows us the limitations of the cross-entropy method:
- For training, our episodes have to be finite and preferably short
- The total reward for the episodes should have enough variability to separate good episodes from bad ones
- There is no intermediate indication about whether the agent has succeeded or failed
A list of tweaks of the code needed for the successful FrozenLake model using cross-entropy:
- Larger batches of played episodes
  - FrozenLake requires at least 100 just to get some successful episodes
- Discount factor(0.9 or 0.95) applied to reward
  - To make the total reward for the episode depend on episode length, and add variety in episodes
- Keeping “elite” episodes for a longer time
  - As successful episodes are much rarer in FrozenLake, we need to keep them for several iterations to train on them
- Decrease learning rate
- Much longer training time
  - 50% successful episodes require 5k training iterations
To incorporate all these into our code, we need to change the filter_batch function to calculate discounted reward and return “elite” episodes for us to keep:

def filter_batch(batch, percentile):
	disc_rewards = list(map(lambda s: s.reward * (GAMMA **
len(s.steps)), batch))
	reward_bound = np.percentile(disc_rewards, percentile)
	train_obs = []
	train_act = []
	elite_batch = []
	for example, discounted_reward in zip(batch, disc_rewards):
		if discounted_reward > reward_bound:
			train_obs.extend(map(lambda step: step.observation,
example.steps))
			train_act.extend(map(lambda step: step.action,
example.steps))
elite_batch.append(example)
	return elite_batch, train_obs, train_act, reward_bound

Then, in the training loop, we will store previous “elite” episodes to pass them to the preceding function on the next training iteration.

full_batch = []
for iter_no, batch in enumerate(iterate_batches(env, net,
BATCH_SIZE)):
	reward_mean = float(np.mean(list(map(lambda s: s.reward,
batch))))
	full_batch, obs, acts, reward_bound = filter_batch(full_batch +
batch, PERCENTILE)
	if not full_batch:
		continue
	obs_v = torch.FloatTensor(obs)
	acts_v = torch.LongTensor(acts)
	full_batch = full_batch[-500:]

The rest of the code is the same, except that the learning rate decreased 10 times and the BATCH_SIZE was set to 100.

After waiting (for 1.5 hours), we can see that the training of the model stopped improving around 55% of solved episodes. There are ways to address this (ex. applying entropy loss regularization).

The effect of “slipperiness” in the FrozenLake environment
- We can figure out how the “slipperiness” makes the progress difficult in the FrozenLake environment by creating the ‘nonslippery version’:

env =
gym.envs.toy_text.frozen_lake.FrozenLakeEnv(is_slippery=False)
env = gym.wrappers.TimeLimit(env, max_episode_steps=100)env =
DiscreteOneHotWrapper(env)

…The effect is dramatic: Only 120-140 batch iterations required to be solved (much faster than the noisy environment)

Theoretical background of the cross-entropy method

(This section is optional & You can refer to the original paper on cross-entropy)

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

H(x) is a reward value obtained by some policy x, and p(x) is a distribution of all possible policies.
We want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) divergence which is as follows:

Combining both formulas, we can get an iterative algorithm, which starts with $q0(x) = p(x)$ and on every step improves - Approximation of p(x)H(x)

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

>> The method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

'Experiences & Study > 이브와(KIBWA)' 카테고리의 다른 글

[KIBWA] 웹 구현 : HTML, CSS (1) with ELICE 코딩 (0)	2023.09.07
[Deep Reinforcement Learning Hands On] Chapter.02 (0)	2023.08.04
[Deep Reinforcement Learning Hands On] Chapter.01 (0)	2023.08.04
[이브와] 강화학습의 구성 요소와 그 종류 톺아보기 (0)	2023.06.26
[이브와] DQN 알고리즘과 강화학습, 주가예측 (2) (0)	2023.06.25

'Experiences & Study/이브와(KIBWA)' Related Articles