Reading log: Reinforcement Learning by Python

Table of Contents
Introduction
My thoughts
Memo
- Day 1
- Day 2
- Day 3
- Day 4
- Day 5
- Day 6
- Day 7

Introduction

I read the following book to study a fundamental of Reinforcement Learning.

機械学習スタートアップシリーズ　Ｐｙｔｈｏｎで学ぶ強化学習　入門から実践まで (ＫＳ情報科学専門書)

作者: 久保隆宏
出版社/メーカー: 講談社
発売日: 2019/02/22
メディア: Kindle版
この商品を含むブログを見る

This article is about my thoughts on this book and a memo about the contents.

My thoughts

A target of this book is an engineer who have ever developed a software with some programming languages. It is good for a person who want to use reinforcement learning techniques for a service or an application. In this book, there are a lot of sample codes written by Python but no basic grammar of it is explained. So, some experiences of Python programming is required to understand.
This book includes basic algorithm, how to solve, advantage, disadvantage and application of reinforcement learning. We can study a lot of things about reinforcement learning widely by reading this book but I think it is difficult to understand all of contents just once.

Memo

Day 1

"Environment" is a space where "Action" and "State" are defined. In the space, "Reward" for achieving a state is given.
A period from start of the environment to the end is called as "1 episode". A purpose of reinforcement learning is to maximize the reward which can be got during the 1 episode.
A model of reinforcement learning studies two things. First one is how to evaluate an action. Second one is how to select an next action(policy).
Sum of reward can not be calculated until the episode is over. The reward need to be estimated with "discount factor" because the estimated value is not accurate.
Estimated sum of rewards is called as "Expected reward" or "Value".

Day 2

In dynamic programming, a future immediate reward can be calculated by using a cash. This is called as "Memoization".
There are two kinds of direction, "Policy based" and "Value based".
The value is calculated based on the policy. The policy is update in maximizing the value.
Selection by value iteration is not probabilistic because the action which maximizes the value is always selected.

Day 3

In "Model Free method", an agent moves by itself and the experience is accumulated. And then, the agent learns based on the accumulated experience.
"How it acts for an investigation?" or "How it acts for a reward?". This is called as "Exploration-exploitation trade-off".
A value of an action at a state is called as "Q value". A method to learn the Q value is called as "Q-learning".
In Q-learning, an action which transits to a state maximizing a value is selected.

Day 4

An advantage to use neural network is what an agent can learn by using data which is similar with "state" a human observe in fact.
To stabilize a learning, "Experience Reply" method is used. By pooling a action history temporarily and sampling, different time step data at various situation can be used as train data.

Day 5

For deep reinforcement learning, a lot of sample need to be used for training. It is difficult for us to prepare those a lot of samples.
"Local optimization" means an action can get some rewards but can not be optimal.
Iterative training is not efficient. It is important for us not to waste a training.
We should record data during training as much as possible as follow.
Average, maximum and minimum value of reward.
Length of episode.
Value of objective function and output of network.
Entropy of action distribution which is output from a policy.

Day 6

Not only an environment but also an action by an agent affects data.
By improving "environment perception", the agent get easier to learn an information from the environment.
By improving "exploring action", the agent get easier to get samples which proceed with the learning.
Active learning is a method for labeling by selecting data which is effective for learning.
Transfer learning is a method to reuse the trained model for another task.

Day 7

There are two kinds of method to use reinforcement learning, "Optimization of Learning" and "Optimization of Action".
"Optimization of Action" is a method to use an action which gets a reward by reinforcement learning.
"Optimization of Learning" is a method to use a learning process, "optimizing based on rewards" of reinforcement learning.

EurekaMoments

ロボットや自動車の自律移動に関する知識や技術、プログラミング、ソフトウェア開発について勉強したことをメモするブログ