首页 > 其他分享 >【机器学习】决策树

【机器学习】决策树

时间:2023-07-31 15:56:08浏览次数:40  
标签:node current 机器 feature 学习 depth indices best 决策树

Decision Tree

熵 - entropy

数学表达式

\[H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) \]

代码

# UNQ_C1
# GRADED FUNCTION: compute_entropy

def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    ### START CODE HERE ###
    if len(y) != 0:
        # len(y[y == 1]) 统计y中等于1元素的个数 等价于 np.sum(y == 1)
        p1 = len(y[y == 1]) / len(y)
        if p1 != 0 and p1 != 1:
            entropy = -p1 * np.log2(p1) - (1 - p1) * np.log2(1 - p1)
    ### END CODE HERE ###        
    
    return entropy

信息增益 - information gain

数学表达式

\[\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) \]

代码

# UNQ_C3
# GRADED FUNCTION: compute_information_gain

def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    
    # Weights 
    node_entropy = compute_entropy(y_node)
    left_entropy = compute_entropy(y_left)
    right_entropy = compute_entropy(y_right)
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(y_node)
    # Weighted entropy
    weighted_entropy = w_left * left_entropy + w_right * right_entropy
    # Information gain                                                   
    information_gain = node_entropy - weighted_entropy
    ### END CODE HERE ###  
    
    return information_gain
# UNQ_C4
# GRADED FUNCTION: get_best_split

def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    max_info_gain = 0
    ### START CODE HERE ###
    for feature in range(num_features):
        info_gain = compute_information_gain(X, y, node_indices, feature)
        if info_gain > max_info_gain:
            max_info_gain = info_gain
            best_feature = feature
    ### END CODE HERE ##    
   
    return best_feature
# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
import numpy as np
from sklearn.datasets import load_iris


class DecisionTree():
    def __init__(self, X_train, y_train, branch_name, max_depth) -> None:
        self.X_train = X_train
        self.y_train = y_train
        self.k = self.X_train.shape[1] # k类特征
        self.branch_name = branch_name
        self.max_depth = max_depth

    def split_dataset(self, data, featrue, value):
        '''分割数据集'''
        left_data, right_data = [], []    
        for i in range(len(data)):
            if data[i, featrue] <= value:
                left_data.append(data[i, featrue])
            else:
                right_data.append(data[i, featrue])
        return left_data, right_data

    def compute_entropy(self, data):
        '''计算熵'''
        entropy = 0.
        count = np.zeros(3) # 统计3类鸢尾花在数据集中对应的数量
        for i in range(len(data)):
            count[self.y_train[i]] += 1
        for i in range(3):
            term = count[i] / len(data)
            if term != 0:
                entropy += (term * np.log2(term))
        return -entropy
    
    def compute_information_gain(self, data, feature, value):
        '''计算信息增益'''
        # feature: 0 1 2 3 中的某个
        left_data, right_data = self.split_dataset(data, feature, value)
        left_entropy = self.compute_entropy(left_data)
        right_entropy = self.compute_entropy(right_data)
        w_left = len(left_data) / len(data)
        w_right = len(right_data) / len(data)
        weighted_entropy = w_left * left_entropy + w_right * right_entropy                                             
        information_gain = self.compute_entropy(data) - weighted_entropy
        return information_gain

    def get_best_split(self, data):
        '''决策本次划分'''   
        best_feature = -1
        max_info_gain = -1
        # 遍历四类特征,取其中信息增益最大的作为本次划分依据
        for feature in range(self.k):
            values = [row[feature] for row in data]
            for value in values:
                info_gain = self.compute_information_gain(data, feature, value)
                if info_gain > max_info_gain:
                    max_info_gain = info_gain

        return best_feature, max_info_gain

    def build_tree_recursive(self, data, branch_name, current_depth):
        if current_depth == self.max_depth:
            formatting = " "*current_depth + "-"*current_depth
            print(f"{formatting} {branch_name} leaf node with indices")
            return
        best_feature, max_info_gain, best_value = self.get_best_split(data) 
        tree.append((current_depth, branch_name, best_feature))
        
        formatting = "-"*current_depth
        print(f"{formatting} Depth {current_depth}, {branch_name}: Split on feature: {best_feature}")
        
        left_data, right_data = self.split_dataset(data, best_feature, best_value)
        
        self.build_tree_recursive(left_data, "Left", current_depth+1)
        self.build_tree_recursive(right_data, "Right", current_depth+1)

def load_data():
    iris = load_iris()
    iris_feature = iris.data
    iris_target = iris.target
    #数据进行分割(训练数据和测试数据)
    # X_train, X_test = iris_feature[:120, :], iris_feature[120:, :]
    # y_train, y_test = iris_target[:120], iris_target[120:]
    return iris_feature, iris_target

if __name__ == '__main__':
    tree = []
    # model = DecisionTree(X_train, y_train, root_indices, "Root", max_depth=2)
    # model.build_tree_recursive(root_indices, "Root", current_depth=0) 
    X_train, y_train = load_data()
    model = DecisionTree(X_train, y_train, "Root", max_depth=3)

标签:node,current,机器,feature,学习,depth,indices,best,决策树
From: https://www.cnblogs.com/MrFeng2997/p/17592029.html

相关文章

  • 【机器学习】K-Means
    K-Means找最接近的质心公式\[c^{(i)}:=j\quad\mathrm{that\;minimizes}\quad||x^{(i)}-\mu_j||^2\]其中,范式\(||X||\),其计算公式为\[||X||=\sqrt{x_1^2+x_2^2+\cdots+x_n^2}\]代码#UNQ_C1#GRADEDFUNCTION:find_closest_centroidsdeffind_closest......
  • 工业机器人的形态(非姿态)
    工业机器人的形态当我们描述机器人在空间的一个位姿时,通常使用直角坐标系、工具坐标系或用户坐标系(统称为笛卡尔坐标系)的点。但是同样的一个位姿对于关节坐标系来说可能有多个值。假定当六轴机器人处于零点位置时,各坐标系的值如下表。关节坐标系直角坐标系各轴均为0......
  • 爬虫学习(一)
    爬虫学习(一)简单爬虫我们需要学习urllib库,在这个库中存在着许多辅助我们进行爬虫的工具,该包中有着模块:request:最基本的HTTP请求模块,可以用来模拟发送请求。error:异常处理抹开,如果出现请求错误,可以捕捉异常,然后进行充实或其他操作。parse:工具模块,提供了许多URL处理方法,如拆分,......
  • python学习_元组
    一、什么是元组?元组也是python内置的数据结构,是一个不可变的序列,他也可以存放不同数据类型的元素不可变序列有:就是不可以改变的序列,没有增、删、改的操作,如元组、字符串就是不可变序列可变序列:可以对序列进行增、删、改操作,对象地址不发生改变,如列表、字典等'''不可变序列与......
  • 怎么学习C语言,才能快速掌握?
    有多年软件行业经验,期间参与过多个C语言项目。要掌握一门编程语言,仅仅投入时间学习是不够的,关键在于实际项目经验。在没有真正实战经验之前,不宜轻易声称掌握某种编程语言,因为编程是积累性的工作,理论知识重要但实践更为关键。学习任何编程语言都需要先掌握理论基础,然后通过项目实战......
  • Java学习
    数据类型整数类型:byte1个字节,short2个字节,int3个字节,long8个字节。浮点类型:float4个字节,double8个字节,字符类型:char2个字节银行业务不能用浮点数进行比较,用BigDecimal(数学工具类)所有的字符本质上还是数字。转义字符:\t制表符空格\n换行类型转换:由低到高b......
  • 站桩学习整理
    姿势调整由下至上双脚分开,略宽于肩膀,脚尖向前膝盖微曲(方便大腿内侧发力,也能防止盆骨前倾)大腿内侧肌肉收缩(不用太大的力,但是需要收缩)注意盆骨千万不要前倾,胯微下坐,因为膝盖微曲,自然会微微下坐,且大腿内侧用了,会支撑住保持脊柱挺直,在放松的前提下挺到最直,要是用力挺容易累双手......
  • 【机器学习】多变量线性回归
    LinerRegressionwithMultipleVariable用向量实现的代码,单变量和多变量可以共用多变量线性回归相当于是单变量的扩展,主要还是按照模型假设、构造代价函数和研究代价函数的最小值这样的思路展开。与单变量线性回归不同的是,多变量线性回归还可能涉及到特征缩放的问题,主要原因......
  • 【TCP】学习笔记:application/octet-stream
    当浏览器在请求资源时,会通过http返回头中的content-type决定如何显示/处理将要加载的数据,如果这个类型浏览器能够支持阅览,浏览器就会直接展示该资源,比如png、jpeg、video等格式。在某些下载文件的场景中,服务端可能会返回文件流,并在返回头中带上Content-Type:application/octet-st......
  • 工业机器人坐标系详解(基于六轴串联机器人和SCARA机器人)
    工业机器人的坐标系机器人的坐标系是重中之重,它是理解机器人运动的基础。机器人所有运动的点位都是建立在坐标系的基础之上,所以如果坐标系不理解,那么就很难真实了解机器人是如何运动的。什么是坐标系?我们需要移动机器人来工作,但是如何让机器人移动?当然我们可以单独控制机器人的......