Python实现决策树C4.5算法的示例

  • Post category:Python

Python实现决策树C4.5算法的示例

什么是决策树C4.5算法?

决策树C4.5算法是一种常用的分类算法,它的基本思是通过对数据集进行划分,构建一棵树形结构,从而实现对数据的分类。C4.5算法是ID3算法的改进版,它在ID3算法的基础上引入了信息增益比的概念,解决了ID3算法中存在的一些问题。

决策树C4.5算法的实现步骤

决策树C4.5算法的实现步骤如下:

  1. 选择最优特征:根据信息增益比选择最优特征作为当前节点的划分特征。

  2. 划分数据集:根据选择的特征将数据集划分成若干个子集。

  3. 递归构建决策树:对于个子集,重复步骤1和步骤2,直到所有子集都为同一类别或者无法再划分为止。

4.枝处理:对构建好的决策树进行剪枝处理,提高决策树的泛化能力。

示例1:使用决策树C4.5算法进行分类

下面是一个示例,用于演示如何使用决策树C4.5算法进行分类。在这个示例中,我们假设有一个数据集,它包含5个样本,每个样本有两个特征,分别是年龄和收入,以及一个类别标签,表示该样本属于哪个类别。数据集如下:

年龄 收入 类别
20 2000 0
25 2500 0
30 3000 1
35 3500 1
40 4000 1

我们需要使用决策树C4.5算法对这个数据集进行分类。

import math

def calc_entropy(data_set):
    num_entries = len(data_set)
    label_counts = {}
    for feat_vec in data_set:
        current_label = feat_vec[-1]
        if current_label not in label_counts.keys():
            label_counts[current_label] = 0
        label_counts[current_label] += 1
    entropy = 0.0
    for key in label_counts:
        prob = float(label_counts[key]) / num_entries
        entropy -= prob * math.log(prob, 2)
    return entropy

def split_data_set(data_set, axis, value):
    ret_data_set = []
    for feat_vec in data_set:
        if feat_vec[axis] == value:
            reduced_feat_vec = feat_vec[:axis]
            reduced_feat_vec.extend(feat_vec[axis+1:])
            ret_data_set.append(reduced_feat_vec)
    return ret_data_set

def choose_best_feature_to_split(data_set):
    num_features = len(data_set[0]) - 1
    base_entropy = calc_entropy(data_set)
    best_info_gain_ratio = 0.0
    best_feature = -1
    for i in range(num_features):
        feat_list = [example[i] for example in data_set]
        unique_vals = set(feat_list)
        new_entropy = 0.0
        split_info = 0.0
        for value in unique_vals:
            sub_data_set = split_data_set(data_set, i, value)
            prob = len(sub_data_set) / float(len(data_set))
            new_entropy += prob * calc_entropy(sub_data_set)
            split_info -= prob * math.log(prob, 2)
        info_gain = base_entropy - new_entropy
        if split_info == 0.0:
            continue
        info_gain_ratio = info_gain / split_info
        if info_gain_ratio > best_info_gain_ratio:
            best_info_gain_ratio = info_gain_ratio
            best_feature = i
    return best_feature

def majority_cnt(class_list):
    class_count = {}
    for vote in class_list:
        if vote not in class_count.keys():
            class_count[vote] = 0
        class_count[vote] += 1
    sorted_class_count = sorted(class_count.items(), key=lambda x: x[1], reverse=True)
    return sorted_class_count[0][0]

def create_tree(data_set, labels):
    class_list = [example[-1] for example in data_set]
    if class_list.count(class_list[0]) == len(class_list):
        return class_list[0]
    if len(data_set[0]) == 1:
        return majority_cnt(class_list)
    best_feat = choose_best_feature_to_split(data_set)
    best_feat_label = labels[best_feat]
    my_tree = {best_feat_label: {}}
    del(labels[best_feat])
    feat_values = [example[best_feat] for example in data_set]
    unique_vals = set(feat_values)
    for value in unique_vals:
        sub_labels = labels[:]
        my_tree[best_feat_label][value] = create_tree(split_data_set(data_set, best_feat, value), sub_labels)
    return my_tree

data_set = [[20, 2000, 0], [25, 2500, 0], [30, 3000, 1], [35, 3500, 1], [40, 4000, 1]]
labels = ['age', 'income']
my_tree = create_tree(data_set, labels)
print(my_tree)

在这个示例中,我们定义了calc_entropy、split_data_set、choose_best_feature_to_split、majority_cnt和create_tree五个函数,用于实现决策树C4.5算法。然后,我们使用data_set和labels两个参数调用create_tree函数,构建决策树,并输出结果。

示例2:使用决策树C4.5算法进行分类

下面是另一个示例,用于演示如何使用决策树C4.5算法进行分类。在这个示例中,我们假设有一个数据集,它包含8个样本,每个样本有三个特征,分别是色泽、根蒂和敲声,以及一个类别标签,表示该样本属于哪个类别。数据集如下:

色泽 根蒂 敲声 类别
青绿 蜷缩 浊响 0
乌黑 蜷缩 沉闷 0
乌黑 稍蜷 浊响 1
青绿 稍蜷 浊响 1
浅白 蜷缩 浊响 0
乌黑 稍蜷 沉闷 1
青绿 硬挺 浊响 0
浅白 稍蜷 浊响 1

我们需要使用决策树C4.5算法对这个数据集进行分类。

import math

def calc_entropy(data_set):
    num_entries = len(data_set)
    label_counts = {}
    for feat_vec in data_set:
        current_label = feat_vec[-1]
        if current_label not in label_counts.keys():
            label_counts[current_label] = 0
        label_counts[current_label] += 1
    entropy = 0.0
    for key in label_counts:
        prob = float(label_counts[key]) / num_entries
        entropy -= prob * math.log(prob, 2)
    return entropy

def split_data_set(data_set, axis, value):
    ret_data_set = []
    for feat_vec in data_set:
        if feat_vec[axis] == value:
            reduced_feat_vec = feat_vec[:axis]
            reduced_feat_vec.extend(feat_vec[axis+1:])
            ret_data_set.append(reduced_feat_vec)
    return ret_data_set

def choose_best_feature_to_split(data_set):
    num_features = len(data_set[0]) - 1
    base_entropy = calc_entropy(data_set)
    best_info_gain_ratio = 0.0
    best_feature = -1
    for i in range(num_features):
        feat_list = [example[i] for example in data_set]
        unique_vals = set(feat_list)
        new_entropy = 0.0
        split_info = 0.0
        for value in unique_vals:
            sub_data_set = split_data_set(data_set, i, value)
            prob = len(sub_data_set) / float(len(data_set))
            new_entropy += prob * calc_entropy(sub_data_set)
            split_info -= prob * math.log(prob, 2)
        info_gain = base_entropy - new_entropy
        if split_info == 0.0:
            continue
        info_gain_ratio = info_gain / split_info
        if info_gain_ratio > best_info_gain_ratio:
            best_info_gain_ratio = info_gain_ratio
            best_feature = i
    return best_feature

def majority_cnt(class_list):
    class_count = {}
    for vote in class_list:
        if vote not in class_count.keys():
            class_count[vote] = 0
        class_count[vote] += 1
    sorted_class_count = sorted(class_count.items(), key=lambda x: x[1], reverse=True)
    return sorted_class_count[0][0]

def create_tree(data_set, labels):
    class_list = [example[-1] for example in data_set]
    if class_list.count(class_list[0]) == len(class_list):
        return class_list[0]
    if len(data_set[0]) == 1:
        return majority_cnt(class_list)
    best_feat = choose_best_feature_to_split(data_set)
    best_feat_label = labels[best_feat]
    my_tree = {best_feat_label: {}}
    del(labels[best_feat])
    feat_values = [example[best_feat] for example in data_set]
    unique_vals = set(feat_values)
    for value in unique_vals:
        sub_labels = labels[:]
        my_tree[best_feat_label][value] = create_tree(split_data_set(data_set, best_feat, value), sub_labels)
    return my_tree

data_set = [['青绿', '蜷缩', '浊响', 0], ['乌黑', '蜷缩', '沉闷', 0], ['乌黑', '稍蜷', '浊响', 1], ['青绿', '稍蜷', '浊响', 1], ['浅白', '蜷缩', '浊响', 0], ['乌黑', '稍蜷', '沉闷', 1], ['青绿', '硬挺', '浊响', 0], ['浅白 稍蜷', '浊响', 1]]
labels = ['色泽', '根蒂', '敲声']
my_tree = create_tree(data_set, labels)
print(my_tree)

在这个示例中,我们定义了calc_entropy、split_data_set、choose_best_feature_to_split、majority_cnt和create_tree五个函数,用于实现决策树C4.5算法。然后,我们使用data_set和labels两个参数调用create_tree函数,构建决策,并输出结果。

结论

本文介绍了如何使用Python实现决策树C4.5算法,并提供了两个示例说明。在实际应用中,我们可以根据具体的问题选择不同的算法实现方式,并结合其他算法进行综合处理,实现复杂的数据结构和算法。