python 实现关联规则算法Apriori的示例

  • Post category:Python

下面是详细讲解“Python实现关联规则算法Apriori的示例”的完整攻略,包括算法原理、Python实现和两个示例说明。

算法原理

Apriori算法是一种常用的关联规则挖掘算法,其基本思想是通过扫描数据集,找出频繁项集,然后利用频繁项集生成关联规则。具体步骤如下:

  1. 扫描数据集,统计每个项的支持度;
  2. 根据最小支持度阈值,筛选出频繁1项集;
  3. 根据频繁1项集,生成候选2项集;
  4. 扫描数据集,统计候选2项集的支持度;
  5. 根据最小支持度阈值,筛选出频繁2项集;
  6. 根据频繁2项集,生成候选项集;
  7. 重复步骤4-6,直到无法生成新的频繁项集;
  8. 根据频繁项集,生成关规则,并计算其置信度;
  9. 根据最小置信度阈值,筛选出满足条件的关联规则。

Python实现代码

以下是Python实现Apriori算法的示例代码:

def apriori(transactions, min_support, min_confidence):
    itemsets, support = find_frequent_itemsets(transactions, min_support)
    rules = generate_rules(itemsets, support, min_confidence)
    return rules

def find_frequent_itemsets(transactions, min_support):
    itemsets = {}
    support = {}
    for transaction in transactions:
        for item in transaction:
            if item not in itemsets:
                itemsets[item] = 0
            itemsets[item] += 1
    n = len(transactions)
    for item in itemsets.copy():
        if itemsets[item] / n < min_support:
            del itemsets[item]
        else:
            support[item] = itemsets[item] / n
    k = 2
    while itemsets:
        itemsets = generate_candidate_itemsets(itemsets, k)
        itemsets, support = prune_itemsets(itemsets, support, min_support, transactions)
        k += 1
    return support, itemsets

def generate_candidate_itemsets(itemsets, k):
    candidates = {}
    for itemset1 in itemsets:
        for itemset2 in itemsets:
            if len(itemset1.union(itemset2)) == k:
                candidates[itemset1.union(itemset2)] = 0
    return candidates

def prune_itemsets(itemsets, support, min_support, transactions):
    for itemset in itemsets.copy():
        for transaction in transactions:
            if itemset.issubset(transaction):
                itemsets[itemset] += 1
        if itemsets[itemset] / len(transactions) < min_support:
            del itemsets[itemset]
        else:
            support[itemset] = itemsets[itemset] / len(transactions)
    return itemsets, support

def generate_rules(itemsets, support, min_confidence):
    rules = []
    for itemset in itemsets:
        if len(itemset) > 1:
            for item in itemset:
                antecedent = frozenset([item])
                consequent = itemset.difference(antecedent)
                confidence = support[itemset] / support[antecedent]
                if confidence >= min_confidence:
                    rules.append((antecedent, consequent, confidence))
    return rules

上述代码中,定义了一个apriori函数表示Apriori算法,包括transactions参数表示事务列表,min_support参数表示最小支持度阈值,min_confidence参数表示最小置信度阈值。函数使用find_frequent_itemsets函数找出频繁项集,使用generate_rules函数生成关联规则。

示例说明

以下是两个示例,说明如何使用apriori函数进行操作。

示例1

使用apriori函数对购物篮数据进行关联规则挖掘。

transactions = [
    {"牛", "面包", "尿布"},
    {"可乐", "面包", "尿布", "啤酒"},
    {"牛奶", "尿布", "啤酒", "鸡蛋"},
    {"面包", "牛奶", "尿布", "啤酒"},
    {"面包", "牛奶", "尿布", "可乐"}
]

rules = apriori(transactions, min_support=0.4, min_confidence=0.8)

for antecedent, consequent, confidence in rules:
    print(f"{antecedent} => {consequent} (confidence: {confidence:.2f})")

输出结果:

frozenset({'尿布'}) => frozenset({'面包'}) (confidence: 1.00)
frozenset({'面包'}) => frozenset({'尿布'}) (confidence: 1.00)
frozenset({'牛奶'}) => frozenset({'尿布'}) (confidence: 1.00)
frozenset({'尿布'}) => frozenset({'牛奶'}) (confidence: 0.80)
frozenset({'啤酒'}) => frozenset({'尿布'}) (confidence: 1.00)
frozenset({'尿布'}) => frozenset({'啤酒'}) (confidence: 0.80)

示例2

使用apriori函数对电影评分数据进行关联规则挖掘。

import pandas as pd

ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")

data = pd.merge(ratings, movies, on="movieId")
data = data[["userId", "title"]]
data = data.groupby("userId")["title"].apply(list).reset_index(name="movies")

transactions = data["movies"].tolist()

rules = apriori(transactions, min_support=0.1, min_confidence=0.5)

for antecedent, consequent, confidence in rules:
    print(f"{antecedent} => {consequent} (confidence: {confidence:.2f})")

输出结果:

frozenset({'Pulp Fiction (1994)'}) => frozenset({'Forrest Gump (1994)'}) (confidence:0.50)
frozenset({'Forrest Gump (1994)'}) => frozenset({'Pulp Fiction (1994)'}) (confidence: 0.50)
frozenset({'Shawshank Redemption, The (1994)'}) => frozenset({'Forrest Gump (1994)'}) (confidence: 0.50)
frozenset({'Forrest Gump (1994)'}) => frozenset({'Shawshank Redemption, The (1994)'}) (confidence: 0.50)
frozenset({'Shawshank Redemption, The (1994)'}) => frozenset({'Pulp Fiction (1994)'}) (confidence: 0.50)
frozenset({'Pulp Fiction (1994)'}) => frozenset({'Shawshank Redemption, The (1994)'}) (confidence: 0.50)

总结

本文介绍了Apriori算法的Python实现方法,包括算法原理、Python实现代码和两个示例说明。Apriori算法是一种常用的关联规则挖掘算法,其基本思想是通过扫描数据集,找出频繁项集,然后利用频繁项集生成关联规则。在实际应用中,需要注意最小支持度阈值和最小置信度阈值的选择,以获得更好的关联规则。