sklearn 使用以及数据集拆分与特征预处理

标签：iris None print train 拆分 test 鸢尾花预处理 sklearn

莺尾花预测是KNN一个经典的例子，可以理解为一个分类预测。莺尾花的数据集有150 条，每个样本包含：

特征值四：花瓣花萼的长度和宽度
目标值有3种：setosa、versicolor、virginica

1. sklearn 数据集介绍

其获取数据集有两种方式：

load_xxx: 获取小数据集(依赖库自带)
fetch_xxx: 从互联网下载一些大的数据集

以莺尾花数据集为例子：

from sklearn.datasets import load_iris
# 获取鸢尾花数据集
iris = load_iris()
print("鸢尾花数据集的返回值：\n", iris)
# 返回值是一个继承自字典的Bench
print("鸢尾花的特征值:\n", iris["data"])
print("鸢尾花的目标值：\n", iris.target)
print("鸢尾花特征的名字：\n", iris.feature_names)
print("鸢尾花目标值的名字：\n", iris.target_names)
print("鸢尾花的描述：\n", iris.DESCR)

结果：

鸢尾花数据集的返回值：
 {'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2], ...
鸢尾花的特征值:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2] ...
鸢尾花的目标值：
 [0 0 0 0 0 ...
鸢尾花特征的名字：
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
鸢尾花目标值的名字：
 ['setosa' 'versicolor' 'virginica']
鸢尾花的描述：
 .. _iris_dataset:
 ...

2. 莺尾花数据可视化

# 内嵌绘图
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
# 获取鸢尾花数据集
iris = load_iris()
# 把数据转换成dataframe的格式
iris_d = pd.DataFrame(iris['data'], columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_d['Species'] = iris.target

def plot_iris(iris, col1, col2):
    sns.lmplot(x = col1, y = col2, data = iris, hue = "Species", fit_reg = False)
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.title('iris graph')
    plt.show()
plot_iris(iris_d, 'Petal_Width', 'Sepal_Length')

结果：

sklearn 使用以及数据集拆分与特征预处理_数据

3. 数据集拆分

方法：

def train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
):
    """Split arrays or matrices into random train and test subsets.

    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also None, it will
        be set to 0.25.

    train_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.

    random_state : int, RandomState instance or None, default=None
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    shuffle : bool, default=True
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.

    stratify : array-like, default=None
        If not None, data is split in a stratified fashion, using this as
        the class labels.
        Read more in the :ref:`User Guide <stratification>`.

    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.

        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.

测试：默认是训练:测试=3:1, 也就是训练=0.75

# 内嵌绘图
from sklearn.datasets import load_iris

# 1、获取鸢尾花数据集
from sklearn.model_selection import train_test_split

iris = load_iris()
# 对鸢尾花数据集进行分割
# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test
# random_state： 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
print("x_train:\n", x_train.shape)
print("x_train:\n", x_train)
print("x_test:\n", x_test)
print("y_train:\n", y_train)
print("y_test:\n", y_test)

4. 特征预处理

1. 什么是特征预处理

简单说：通过一些转换函数将特征数据转换成更加适合算法模型的特征数据的过程。包含归一化和标准化。

sklearn 使用以及数据集拆分与特征预处理_归一化_02

2. 为什么需要进行归一化/标准化?

特征的单位或者大小差异较大，或者某特征的方差相比其他的特征要大出几个数量级，容易影响支配目标结果，使得一些算法无法学习到其他的特征。

同样以上面的数据为例子，特征1和特征2的数据量相差比较大，但是我们认为其重要程度是一样的，因此我们需要将其进行归一化与标准化。

3. 归一化

通过对原始数据进行变换把数据映射到(默认为[0,1])之间

sklearn 使用以及数据集拆分与特征预处理_数据集_03

作用于每一列，max为一列的最大值，min为一列的最小值,那么X’’为最终结果，mx，mi分别为指定区间值默认mx为1,mi为0

import numpy as np
from sklearn.preprocessing import MinMaxScaler

# 1. 构造数据
x = np.array([[90, 2, 10, 40], [60, 4, 15, 45], [75, 3, 13, 46]])
print(x)
transfer = MinMaxScaler(feature_range=(0, 1))
# 2. 调用fit_transform
data = transfer.fit_transform(x)
print("最小值最大值归一化处理的结果：\n", data)

结果：

[[90  2 10 40]
 [60  4 15 45]
 [75  3 13 46]]
最小值最大值归一化处理的结果：
 [[1.         0.         0.         0.        ]
 [0.         1.         1.         0.83333333]
 [0.5        0.5        0.6        1.        ]]

最大值最小值是变化的，另外，最大值与最小值非常容易受异常点(数据特别大或者特别小)影响，所以这种方法鲁棒性较差，只适合传统精确小数据场景。

4. 标准化

通过对原始数据进行变换把数据变换到均值为0,标准差为1范围内。在已有样本足够多的情况下比较稳定，适合现代嘈杂大数据场景。

sklearn 使用以及数据集拆分与特征预处理_数据集_05

作用于每一列，mean为平均值，σ为标准差

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# 1. 构造数据
x = np.array([[90, 2, 10, 40], [60, 4, 15, 45], [75, 3, 13, 46]])
print(x)
transfer = StandardScaler()
# 2. 调用fit_transform
data = transfer.fit_transform(x)
print("标准化的结果:\n", data)
print("每一列特征的平均值：\n", transfer.mean_)
print("每一列特征的方差：\n", transfer.var_)

结果：

[[90  2 10 40]
 [60  4 15 45]
 [75  3 13 46]]
标准化的结果:
 [[ 1.22474487 -1.22474487 -1.29777137 -1.3970014 ]
 [-1.22474487  1.22474487  1.13554995  0.50800051]
 [ 0.          0.          0.16222142  0.88900089]]
每一列特征的平均值：
 [75.          3.         12.66666667 43.66666667]
每一列特征的方差：
 [150.           0.66666667   4.22222222   6.88888889]

【当你用心写完每一篇博客之后,你会发现它比你用代码实现功能更有成就感!】

标签：iris,None,print,train,拆分,test,鸢尾花,预处理,sklearn
From： https://blog.51cto.com/u_12826294/5729537