对数几率回归的权重问题
首先:sklearn官网说明:
coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_
is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial'
, coef_
corresponds to outcome 1 (True) and -coef_
corresponds to outcome 0 (False).
根据说明应该是x向量有几维,就应该有多少维的权重w$_i$,如果是二分类,w应该是1列矩阵
如果是多分类,w应该是n_classes* n_features形状的矩阵
容易搞混的是:
上式中x$_i$都是向量,不是样本的矩阵(行为样本,列为特征),x$_i$和样本矩阵的关系是X$_i$等于样本矩阵的转置,因此导致了开始学习过程中的迷惑。
使用sklearn中的LogisticRegression和iris数据进行测试:
测试代码:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
iris
X = iris.data[:, [0,1,2,3]]
# type(iris.data)
y = iris.target
print('class labels ',np.unique(y))
print(X)
输出:
class labels [0 1 2]
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
。。。。
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3,
random_state=1, stratify=y ,shuffle= True
)
## stratify=y 采用分层抽样,测试集合训练集中样本的比例保持不变
# 如果不是分层抽样,则会发生变化,使用stratify=None
print('lable counts in y',np.bincount(y))
print('lable counts in y_train',np.bincount(y_train))
print('lable counts in y_test',np.bincount(y_test))
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train) #获得期望和标准差
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
print(X_train_std)
lr = LogisticRegression(C=100, random_state= 1,
solver= 'lbfgs', multi_class= 'ovr')
lr.fit(X_train_std, y_train
lr.coef_
输出的权重矩阵为:
array([[-2.27467146, 2.04221619, -4.01937844, -3.46372044],
[-1.2777746 , -0.93618002, 3.53177076, -2.24098008],
[-0.31117851, -2.10556199, 9.51791722, 8.48747802]])
令此矩阵为coef_,则coef__$^T$为4行3列
显然coef_是3*4矩阵,因为数据选用了4个特征,并且是y是3类
此权重矩阵对应3个列向量,每个列向量为w$_i$,则w$_i$$^T$*x$_i$+b则为原线性模型的值(x$_i$是任何一个样本的列向量表示)
3个列向量w$转置后分别与任何一个样本的列向量相乘再加截距,即可得原线性模型的值,最后套上下面的对数几率函数(也就是z),可得所对应的y值,也就是predict值
lr.predict_proba(X_test_std[:5,:])
选5个样本,得出属于3个不同类预测值为:
array([[1.37123418e-06, 8.02119410e-02, 9.19786688e-01],
[9.80794607e-01, 1.92053927e-02, 2.10785668e-16],
[8.83996901e-01, 1.16003099e-01, 1.98669981e-16],
[5.92717248e-05, 5.91544530e-01, 4.08396198e-01],
[2.12433327e-04, 9.93236935e-01, 6.55063129e-03]])