Fundamentals of Classification by Supervised Learning ~1 dimensional input~

Table of Contents
Introduction
Author
GitHub
Input sample data: Weight of insect
Target sample data: Sex of insect
Sample data generation
Policy to solve problem
Classification with probability
Maximum likelihood estimation
Logistic Regression
Cross entropy error
Calculating parameter by Gradient method
Summary of classification sequence

Introduction

This is my studying log about machine learning, supervised classification. I referred to a following book.

Pythonで動かして学ぶ！あたらしい機械学習の教科書第2版 (AI & TECHNOLOGY)

作者: 伊藤真
出版社/メーカー: 翔泳社
発売日: 2019/07/18
メディア: 単行本（ソフトカバー）
この商品を含むブログを見る

I extracted some important points and some related sample python codes and wrote them as memo. In particular, this article focuses on 2 class classification with 1 dimensional input.

Author

researchmap.jp

GitHub

Sample codes and any other related files are released at the following GitHub repository.
github.com

Input sample data: Weight of insect

f:id:sy4310:20190810174930p:plain

Target sample data: Sex of insect

f:id:sy4310:20190810175050p:plain

Sample data generation

The number of sample data $N$ is 50. The max value of $x_N$ is 2.5 and the min value is 0. The target label $t$ is male(1) or female(0). The each value of $X$ and $t$ is printed as follow.
f:id:sy4310:20190810230602p:plain
A scatter plot of sample data is the following figure. The plot colored as gray is male and the other one colored as blue is female.
f:id:sy4310:20190810230510p:plain

Policy to solve problem

The policy to solve this classification problem is to decide a boundary between male and female. This is called "decision boundary".

Classification with probability

In the above plot of sample data, if the weight was between 0.8g and 1.2g, the sex can be predicted with a probability including an ambiguity. For example, "The probability which the sex is male is 1/3". This probability depends on the weight, $x$ . In this case, the probability which the sex is male is called "Conditional probability" as follow.
f:id:sy4310:20190812153124p:plain

Maximum likelihood estimation

For example, the target label $t$ at the first 3times are $t=0$ and the label $t$ at 4th time is $t=1$ . According to this information, the following simple model is defined.
f:id:sy4310:20190812161423p:plain
In this problem, "a probability which the target label data $t = 0, 0, 0, 1$ is generated from the above model". This probability is called "Likelihood". By Maximum likelihood estimation, the probability $w$ which the likelihood is the most highest is calculated. The likelihood is expressed as follow.
f:id:sy4310:20190812163710p:plain
The following plot is the calculated likelihood. The max likelihood is about 0.1055. And then, the probability which the sex is male is 0.25.
f:id:sy4310:20190812173113p:plain
The male probability $w$ when the likelihood is maximum can be calculated as follow.
f:id:sy4310:20190812180446p:plain
By taking the logarithm of the above likelihood model, the calculation can be easier. This is called "Log likelihood" and a purpose function for a probabilistic classification instead of mean square error. In this case, the parameter which maximizes the log likelihood need to be searched.

Logistic Regression

In almost cases, the data doesn't have a uniform distribution. The conditional probability $P(t=1|x)$ can be expressed as "Logistic regression model" by assuming that it is generated from gaussian distribution.
The following model is logistic regression model. This is made by integrating linear line model with sigmoid function. An output is limited from 0 to 1 by passing the line model through the sigmoid model. f:id:sy4310:20190812203501p:plain
This plot is an example of the logistic regression model.
f:id:sy4310:20190812222432p:plain

Cross entropy error

The probability which the label $t$ is 1 at a weight $x$ is expressed as follow by using the logistic regression model.
f:id:sy4310:20190813105932p:plain
This policy is what calculating the highest probabilistic parameter by assuming that the data is generated from the above model. In the case of $t=1$ , the probability is $y$ . On the other hand, the probability of $t=0$ is $1-y$ . By considering both cases, the above model is expressed as follow. This is used for one data.
f:id:sy4310:20190813142454p:plain
This is for data $\times N$ as "Likelihood".
f:id:sy4310:20190813151029p:plain
This is "Cross entropy error" which the above function times $-1$ same as mean square error. And then, "Mean cross entropy error" for data $\times N$ is defined as follow. By this definition, it is difficult for the error value to be affected by the number of data.
f:id:sy4310:20190813153501p:plain
The following 2 figures are all of calculated mean cross entropy error in changing the parameters of linear line model, $w_0$ and $w_1$ . The left side figure is 3D contour map of mean cross entropy error and the right side figure is 2D contour map of one. According to these figures, the error is minimized at around point, $w_0=7$ , $w_1=-7$ .
f:id:sy4310:20190813175624p:plain

Calculating parameter by Gradient method

The parameters, $w_0$ and $w_1$ can be calculated by Gradient method as follow.
The mean cross entropy error.
f:id:sy4310:20190813212003p:plain
A partial derivative of $E(W)$ at $w_0$ .
f:id:sy4310:20190813212344p:plain
$y_n$ is defined as follow. And then, $a_n$ is called "Total input".
f:id:sy4310:20190813212653p:plain
The partial derivative can be calculated with "Chain rule" as follow.
f:id:sy4310:20190813212953p:plain
As result, the partial derivative of $E(W)$ at $w_0$ can be calculated as follow.
f:id:sy4310:20190813213420p:plain
In the same way, the partial derivative of $E(W)$ at $w_1$ can be calculated as follow too.
f:id:sy4310:20190813213836p:plain
As result, those parameters were calculated as follow.
f:id:sy4310:20190814111746p:plain
f:id:sy4310:20190814111838p:plain
According to the above contour map, the predicted parameters were $w_0=7$ , $w_1=-7$ . The real calculated parameters were $w_0=6.13$ , $w_1=-6.20$ . These parameters are close to the predicted ones.

Summary of classification sequence

1. Creating "Logistic regression model"

The model to output the probability which the target label $t=1$ as follow.
f:id:sy4310:20190814115106p:plain
f:id:sy4310:20190814115142p:plain
f:id:sy4310:20190814115432p:plain

2. Defining "Likelihood"

By considering the above regression model, the likelihood per unit data is expressed as follow.
f:id:sy4310:20190814115509p:plain
And then, "Log likelihood" for all of data $\times N$ is defined as follow.
f:id:sy4310:20190814130509p:plain