当我们要解决一个分类问题,尤其是一个二分类问题时,如果我们用线性回归去解决就会面临这样一个问题:样本量变大后,准确率会下降。这时为了更好地解决这种分类问题,我们就需要采用逻辑回归的方法了。现在有两个逻辑回归的实战案例:考试通过预测、芯片检测通过预测。同样本次练习也是基于sk-learn库, 通过逻辑回归实现二分类。
首先看第一个目标:基于数据集建立逻辑回归模型,并预测给定两门分数的情况下,预测第三门分数是否能通过;建立二阶边界,提高模型的准确率。
下面的内容基本就是Jupyter NoteBook的内容了:
1.基于testdata.csv数据,建立逻辑回归模型,评估模型的表现。
2.预测Exam1 = 75,Exam2 = 60 时,该同学能否通过 Exam3。
3.建立二阶边界函数,重复任务1,2步骤。
python#load from csv
import pandas as pd
import numpy as np
data = pd.read_csv('examdata.csv')
data.head()
Exam1 | Exam2 | Pass | |
---|---|---|---|
0 | 34.623660 | 78.024693 | 0 |
1 | 30.286711 | 43.894998 | 0 |
2 | 35.847409 | 72.902198 | 0 |
3 | 60.182599 | 86.308552 | 1 |
4 | 79.032736 | 75.344376 | 1 |
python#visual the data 可视化数据
%matplotlib inline
from matplotlib import pyplot as plt
fig1 = plt.figure()
plt.scatter(data.loc[:, 'Exam1'], data.loc[:, 'Exam2'])
plt.title('Exam1-Exam2')
plt.xlabel('Exam1')
plt.ylabel('Exam2')
plt.show()
python# add lable mask
mask = data.loc[:, 'Pass'] == 1
print(~mask)
0 True 1 True 2 True 3 False 4 False ... 95 False 96 False 97 False 98 False 99 False Name: Pass, Length: 100, dtype: bool
python#visual the data with mask 添加标签
fig2 = plt.figure()
passed = plt.scatter(data.loc[:, 'Exam1'][mask], data.loc[:, 'Exam2'][mask])
failed = plt.scatter(data.loc[:, 'Exam1'][~mask], data.loc[:, 'Exam2'][~mask])
plt.title('Exam1-Exam2')
plt.xlabel('Exam1')
plt.ylabel('Exam2')
plt.legend((passed, failed), ('passed', 'failed'))
plt.show()
python#define X,y 对变量进行数据赋值
X = data.drop(['Pass'], axis=1)
X.head()
X1 = data.loc[:, 'Exam1']
X2 = data.loc[:, 'Exam2']
X2.head()
y = data.loc[:, 'Pass']
y.head()
0 0 1 0 2 0 3 1 4 1 Name: Pass, dtype: int64
pythonprint(X.shape, y.shape)
(100, 2) (100,)
python#establish the model and train it 建立并训练模型
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(X,y)
LogisticRegression()
python# show the pridict result and its accuracy
y_predict = LR.predict(X)
print(y_predict)
[0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1]
pythonfrom sklearn.metrics import accuracy_score
accRet = accuracy_score(y, y_predict)
print(accRet)
0.89
python#exam1 = 70 exam2 = 65
y_test = LR.predict([[75,65]])
print('passed' if y_test == 1 else 'failed')
passed /home/changlin/anaconda3/envs/sk_env/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names "X does not have valid feature names, but"
pythonLR.coef_
array([[0.20535491, 0.2005838 ]])
pythonLR.intercept_
array([-25.05219314])
pythontheta0 = LR.intercept_
theta1,theta2 = LR.coef_[0][0], LR.coef_[0][1]
print(theta0, theta1, theta2)
[-25.05219314] 0.20535491217790364 0.20058380395469022
pythonX2_new = -(theta0 + theta1 * X1)/theta2
print(X2_new)
fig3 = plt.figure()
passed = plt.scatter(data.loc[:, 'Exam1'][mask], data.loc[:, 'Exam2'][mask])
failed = plt.scatter(data.loc[:, 'Exam1'][~mask], data.loc[:, 'Exam2'][~mask])
plt.title('Exam1-Exam2')
plt.xlabel('Exam1')
plt.ylabel('Exam2')
plt.legend((passed, failed), ('passed', 'failed'))
plt.plot(X1, X2_new)
plt.show()
0 89.449169 1 93.889277 2 88.196312 3 63.282281 4 43.983773 ... 95 39.421346 96 81.629448 97 23.219064 98 68.240049 99 48.341870 Name: Exam1, Length: 100, dtype: float64
得到一阶边界模型
下面开始建立二阶边界模型
二阶边界函数:
python# 创建新的数据集
X1_2 = X1 * X1
X2_2 = X2 * X2
X1_X2 = X1 * X2
print(X1_2, X2_2, X1_X2)
0 1198.797805 1 917.284849 2 1285.036716 3 3621.945269 4 6246.173368 ... 95 6970.440295 96 1786.051355 97 9863.470975 98 3062.517544 99 5591.434174 Name: Exam1, Length: 100, dtype: float64 0 6087.852690 1 1926.770807 2 5314.730478 3 7449.166166 4 5676.775061 ... 95 2340.652054 96 7587.080849 97 4730.056948 98 4216.156574 99 8015.587398 Name: Exam2, Length: 100, dtype: float64 0 2701.500406 1 1329.435094 2 2613.354893 3 5194.273015 4 5954.672216 ... 95 4039.229555 96 3681.156888 97 6830.430397 98 3593.334590 99 6694.671710 Length: 100, dtype: float64
pythonX_new = {'X1':X1, 'X2':X2, 'X1_2': X1_2, 'X2_2':X2_2, 'X1_X2':X1_X2}
X_new = pd.DataFrame(X_new)
print(X_new)
X1 X2 X1_2 X2_2 X1_X2 0 34.623660 78.024693 1198.797805 6087.852690 2701.500406 1 30.286711 43.894998 917.284849 1926.770807 1329.435094 2 35.847409 72.902198 1285.036716 5314.730478 2613.354893 3 60.182599 86.308552 3621.945269 7449.166166 5194.273015 4 79.032736 75.344376 6246.173368 5676.775061 5954.672216 .. ... ... ... ... ... 95 83.489163 48.380286 6970.440295 2340.652054 4039.229555 96 42.261701 87.103851 1786.051355 7587.080849 3681.156888 97 99.315009 68.775409 9863.470975 4730.056948 6830.430397 98 55.340018 64.931938 3062.517544 4216.156574 3593.334590 99 74.775893 89.529813 5591.434174 8015.587398 6694.671710 [100 rows x 5 columns]
python# create new model and train
LR2 = LogisticRegression()
LR2.fit(X_new, y)
LogisticRegression()
python# show the pridict result and its accuracy
y2_predict = LR2.predict(X_new)
print(y2_predict)
[0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1]
pythonaccRet2 = accuracy_score(y, y2_predict)
print(accRet2)
1.0
pythonLR2.coef_
array([[-8.95942818e-01, -1.40029397e+00, -2.29434572e-04, 3.93039312e-03, 3.61578676e-02]])
python# 先排序 否则有交叉
X1_new = X1.sort_values()
print(X1_new)
63 30.058822 1 30.286711 57 32.577200 70 32.722833 36 33.915500 ... 56 97.645634 47 97.771599 51 99.272527 97 99.315009 75 99.827858 Name: Exam1, Length: 100, dtype: float64
pythontheta0 = LR2.intercept_
theta1,theta2,theta3,theta4,theta5 = LR2.coef_[0][0],LR2.coef_[0][1],LR2.coef_[0][2],LR2.coef_[0][3],LR2.coef_[0][4]
a = theta4
b = theta5 * X1_new + theta2
c = theta0 + theta1 * X1_new + theta3 * X1_new * X1_new
X2_new_bound = (-b + np.sqrt(b*b - 4*a*c))/(2*a)
# print(theta0,theta1,theta2,theta3,theta4,theta5)
print(X2_new_bound)
63 132.124249 1 130.914667 57 119.415258 70 118.725082 36 113.258684 ... 56 39.275712 47 39.251001 51 38.963585 97 38.955634 75 38.860426 Name: Exam1, Length: 100, dtype: float64
pythonfig4 = plt.figure() plt.plot(X1_new, X2_new_bound) plt.show()
python#visual the data with mask
fig5 = plt.figure()
plt.plot(X1_new, X2_new_bound)
passed = plt.scatter(data.loc[:, 'Exam1'][mask], data.loc[:, 'Exam2'][mask])
failed = plt.scatter(data.loc[:, 'Exam1'][~mask], data.loc[:, 'Exam2'][~mask])
plt.title('Exam1-Exam2')
plt.xlabel('Exam1')
plt.ylabel('Exam2')
plt.legend((passed, failed), ('passed', 'failed'))
plt.show()
1、基于芯片测试.csv数据,建立逻辑回归模型(二阶边界),评估模型表现
2、以函数的方式求解边界曲线
3、描绘出完整的决策边界曲线
pythonimport pandas as pd
import numpy as np
pythondata = pd.read_csv('/root/chip_test.csv')
data.head()
test1 | test2 | pass | |
---|---|---|---|
0 | 0.051267 | 0.69956 | 1 |
1 | -0.092742 | 0.68494 | 1 |
2 | -0.213710 | 0.69225 | 1 |
3 | -0.375000 | 0.50219 | 1 |
4 | 0.183760 | 0.93348 | 0 |
pythonmask = data.loc[:, 'pass'] == 1
print(~mask)
0 False 1 False 2 False 3 False 4 True ... 113 True 114 True 115 True 116 True 117 True Name: pass, Length: 118, dtype: bool
pythonfrom matplotlib import pyplot as plt
fig1 = plt.figure()
passed = plt.scatter(data.loc[:, 'test1'][mask],data.loc[:, 'test2'][mask])
failed = plt.scatter(data.loc[:, 'test1'][~mask],data.loc[:, 'test2'][~mask])
plt.title("test1-test2")
plt.xlabel('test1')
plt.ylabel('test2')
plt.legend((passed,failed),('passed','failed'))
plt.show()
直接进行二阶拟合,先生成新数据
python#define X,y
X = data.drop(['pass'],axis=1)
y = data.loc[:,'pass']
X1 = data.loc[:,'test1']
X2 = data.loc[:,'test2']
X1.head()
#create new data
X1_2 = X1*X1
X2_2 = X2*X2
X1_X2 = X1*X2
X_new = {'X1':X1,'X2':X2,'X1_2':X1_2,'X2_2':X2_2,'X1_X2':X1_X2}
X_new = pd.DataFrame(X_new)
print(X_new)
X1 X2 X1_2 X2_2 X1_X2 0 0.051267 0.699560 0.002628 0.489384 0.035864 1 -0.092742 0.684940 0.008601 0.469143 -0.063523 2 -0.213710 0.692250 0.045672 0.479210 -0.147941 3 -0.375000 0.502190 0.140625 0.252195 -0.188321 4 0.183760 0.933480 0.033768 0.871385 0.171536 .. ... ... ... ... ... 113 -0.720620 0.538740 0.519293 0.290241 -0.388227 114 -0.593890 0.494880 0.352705 0.244906 -0.293904 115 -0.484450 0.999270 0.234692 0.998541 -0.484096 116 -0.006336 0.999270 0.000040 0.998541 -0.006332 117 0.632650 -0.030612 0.400246 0.000937 -0.019367 [118 rows x 5 columns]
python#establish the model and train it
from sklearn.linear_model import LogisticRegression
LR2 = LogisticRegression()
LR2.fit(X_new,y)
LogisticRegression()
python# 评估模型
from sklearn.metrics import accuracy_score
y2_predict = LR2.predict(X_new)
accuracy2 = accuracy_score(y,y2_predict)
print(accuracy2)
0.8135593220338984
pythonX1_new = X1.sort_values()
theta0 = LR2.intercept_
theta1,theta2,theta3,theta4,theta5 = LR2.coef_[0][0],LR2.coef_[0][1],LR2.coef_[0][2],LR2.coef_[0][3],LR2.coef_[0][4]
a = theta4
b = theta5*X1_new+theta2
c = theta0+theta1*X1_new+theta3*X1_new*X1_new
X2_new_boundary = (-b+np.sqrt(b*b-4*a*c))/(2*a)
fig2 = plt.figure()
passed=plt.scatter(data.loc[:,'test1'][mask],data.loc[:,'test2'][mask])
failed=plt.scatter(data.loc[:,'test1'][~mask],data.loc[:,'test2'][~mask])
plt.plot(X1_new,X2_new_boundary)
plt.title('test1-test2')
plt.xlabel('test1')
plt.ylabel('test2')
plt.legend((passed,failed),('passed','failed'))
plt.show()
在成绩预测的例子中因为成绩不可以为负数,所以要舍去为负数的情况,这里可以为负数,不可舍去:
pythond = np.array(b*b-4*a*c)
d = (-b+np.sqrt(b*b-4*a*c))/(2*a)
X1_new
print(np.array(d))
[ nan nan nan nan nan nan nan nan 0.1212617 0.04679448 0.02697935 0.00872189 -0.00830576 -0.00830576 -0.11718731 -0.16040224 -0.18016521 -0.18965258 -0.21671004 -0.22530078 -0.23369452 -0.2499261 -0.28761583 -0.30846849 -0.32171919 -0.32816104 -0.33448523 -0.34069505 -0.35278365 -0.35866807 -0.37013014 -0.42186137 -0.42656594 -0.43119936 -0.43574686 -0.44021779 -0.45318277 -0.47336128 -0.47336128 -0.48465118 -0.48828317 -0.49185073 -0.49185073 -0.51193977 -0.51193977 -0.5181489 -0.52412149 -0.52986615 -0.52986615 -0.53264971 -0.54066023 -0.54572074 -0.55055962 -0.55055962 -0.55518131 -0.56170859 -0.56775529 -0.56775529 -0.58002265 -0.58002265 -0.58588805 -0.59313744 -0.59416537 -0.59514154 -0.59852963 -0.60052615 -0.60280757 -0.60310565 -0.60354216 -0.60379456 -0.60355793 -0.60282483 -0.60105947 -0.60105947 -0.60047489 -0.59135016 -0.59135016 -0.59009769 -0.5887809 -0.58285568 -0.58285568 -0.57390312 -0.56073894 -0.55572236 -0.53219481 -0.52181028 -0.51814159 -0.51814159 -0.50647589 -0.48925514 -0.47987024 -0.46991909 -0.45383618 -0.42349291 -0.40276684 -0.38769565 -0.3714656 -0.353902 -0.34523055 -0.33477608 -0.33477608 -0.32451646 -0.26416907 -0.26416907 nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
以函数的方式求解边界曲线:
python#define f(x)
def f(x):
a = theta4
b = theta5*x+theta2
c = theta0+theta1*x+theta3*x*x
X2_new_boundary1 = (-b+np.sqrt(b*b-4*a*c))/(2*a)
X2_new_boundary2 = (-b-np.sqrt(b*b-4*a*c))/(2*a)
return X2_new_boundary1,X2_new_boundary2
pythonX2_new_boundary1 = []
X2_new_boundary2 = []
for x in X1_new:
X2_new_boundary1.append(f(x)[0])
X2_new_boundary2.append(f(x)[1])
print(X2_new_boundary1)
[ nan nan nan nan nan nan nan nan 0.1212617 0.04679448 0.02697935 0.00872189 -0.00830576 -0.00830576 -0.11718731 -0.16040224 -0.18016521 -0.18965258 -0.21671004 -0.22530078 -0.23369452 -0.2499261 -0.28761583 -0.30846849 -0.32171919 -0.32816104 -0.33448523 -0.34069505 -0.35278365 -0.35866807 -0.37013014 -0.42186137 -0.42656594 -0.43119936 -0.43574686 -0.44021779 -0.45318277 -0.47336128 -0.47336128 -0.48465118 -0.48828317 -0.49185073 -0.49185073 -0.51193977 -0.51193977 -0.5181489 -0.52412149 -0.52986615 -0.52986615 -0.53264971 -0.54066023 -0.54572074 -0.55055962 -0.55055962 -0.55518131 -0.56170859 -0.56775529 -0.56775529 -0.58002265 -0.58002265 -0.58588805 -0.59313744 -0.59416537 -0.59514154 -0.59852963 -0.60052615 -0.60280757 -0.60310565 -0.60354216 -0.60379456 -0.60355793 -0.60282483 -0.60105947 -0.60105947 -0.60047489 -0.59135016 -0.59135016 -0.59009769 -0.5887809 -0.58285568 -0.58285568 -0.57390312 -0.56073894 -0.55572236 -0.53219481 -0.52181028 -0.51814159 -0.51814159 -0.50647589 -0.48925514 -0.47987024 -0.46991909 -0.45383618 -0.42349291 -0.40276684 -0.38769565 -0.3714656 -0.353902 -0.34523055 -0.33477608 -0.33477608 -0.32451646 -0.26416907 -0.26416907 nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
pythonfig3 = plt.figure()
passed=plt.scatter(data.loc[:,'test1'][mask],data.loc[:,'test2'][mask])
failed=plt.scatter(data.loc[:,'test1'][~mask],data.loc[:,'test2'][~mask])
plt.plot(X1_new,X2_new_boundary1)
plt.plot(X1_new,X2_new_boundary2)
plt.title('test1-test2')
plt.xlabel('test1')
plt.ylabel('test2')
plt.legend((passed,failed),('passed','failed'))
plt.show()
这里图的两侧缺失部分是因为test1它对应的边界上的test2的值之有间隔,所以需要生成一个新的密集X1来处理间隔问题。观察图形可发现,最小值大概是-0.9,最大值大概是1.0多一点,所以我们从-0.9开始,每个数间隔为1/10000,数是从0到19000,这样自己生成的X1就比较密集了,也就可以补全图形了。
python# create x range
X1_range = [-0.9 + x/10000 for x in range(0,19000)]
X1_range = np.array(X1_range)
X2_new_boundary1 = []
X2_new_boundary2 = []
for x in X1_range:
X2_new_boundary1.append(f(x)[0])
X2_new_boundary2.append(f(x)[1])
python# coding:utf-8
import matplotlib as mlp
fig4 = plt.figure()
passed=plt.scatter(data.loc[:,'test1'][mask],data.loc[:,'test2'][mask])
failed=plt.scatter(data.loc[:,'test1'][~mask],data.loc[:,'test2'][~mask])
plt.plot(X1_range,X2_new_boundary1,'r')
plt.plot(X1_range,X2_new_boundary2,'r')
plt.title('test1-test2')
plt.xlabel('test1')
plt.ylabel('test2')
plt.title('Chip quality prediction')
plt.legend((passed,failed),('passed','failed'))
plt.show()
这样就得到了完整的闭合二阶边界曲线:
本文作者:Tim
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!