如何让priori算法更快?

问题描述 投票:0回答:1

TLDR问题

这是一个很长的问题,所以这里是一个TLDR版本:我正在实现一个priori算法,它很慢。慢的部分是当我试图生成 Lk 形成 Ck 并且它必须扫描整个数据集(或多或少)以找出候选人是否频繁。怎样才能使这部分速度更快?有没有什么数据结构可以加速数据集的重复搜索?

长问题

我有一个任务是用python写一个priori算法。限制条件是不能使用 pandas 而不是使用任何实现periori的模块,如 apyori. 所以产生 C1L1 是完全没有问题的。生成 CkLk-1 也是可以的,整个事情的瓶颈是产生了 LkCk.本节将把每个候选者与整个数据集进行比较,这对于小的最小支持量来说需要很长时间。

我花了一些时间搜索apriori的改进版本,在我能理解的版本中,这个是最好的(IMO)。https:/arxiv.orgpdf1403.3948.pdf

本文提供了为每一个项目保留一个交易列表,说明该项目itemset在什么交易中出现(我们称之为 found_in). 掌握了这一点,我们就可以在生成 LkCk 因为我们只能扫描该列表中提到的元素(found_in).

我实现了它,它减少了4倍左右的时间,这真是太神奇了!然而,它的速度还不够快,因为我应该提取40000个频繁模式。然而,对于任务来说,它还不够快,因为我应该提取40000个频繁模式。

所以我在想,也许算法很好,但我使用的python的数据结构太慢,赶不上。所以我的问题在这里。

  1. 我可以通过使用更好的数据结构 让这个算法更快吗? 比如说 ctype 也许是?
  2. 是不是我的代码有什么问题,导致它挂掉了?这个算法的结果在我看来是正确的(与下面的 apyori)
  3. 有什么小窍门可以改善或者容易出现一些状况吗?

我知道这个问题需要大量的时间来好好调查,我也不指望它。所以,任何小Tip都是感激的。

代码的部分,是缓慢的。

def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    subset_time = 0
    Lk = {}
    for candidate, TIDs in Ck.items():
        if len(TIDs) < min_support_count:
            continue
        new_TIDs = set()
        tt = time()
        for TID in TIDs:
            comp_transaction = dataset[TID]
            # equivalent to: if candidate.issubset(dataset[TID])
            # this is the slowest part of the code and this is how to make it a
            # bit faster
            if not any(item not in comp_transaction for item in candidate):
                new_TIDs.add(TID)
        if len(new_TIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk

整个代码(抱歉没有注释好)。

from itertools import combinations
import pickle
from time import time


def has_infrequent_subset(candidate: set, previous_L: list) -> bool:
    """
    A function to prone some of candidates

    Parameters
    ----------
    candidate -- a set to check whether all of its subsets are frequent or not.
        if any subset is not frequent, the function will returns True,
        otherwise returns False.
    previous_L -- a list of tuples to check candidate's subsets against it.
        an instance of previous_L could be found in 'Examples' part.

    Returns
    -------
    a boolean value. True means there are some subsets in the candidate that
    are not frequent with respect to previous_L and this value should no be
    included in the final Ck result. False means all subsets are frequent and
    we shall include this candidate in our Ck result.

    Examples
    --------
    >>> previous_L = [(1,2,4),(2,3,6),(2,3,8),(2,6,7),(2,6,8),(3,4,5),(3,6,8)]
    >>> has_infrequent_subset((2,3,6,8), previous_L)
    False
    >>> has_infrequent_subset((2,3,6,7), previous_L)
    True
    """
    subsets = combinations(candidate, len(candidate)-1)
    for subset in subsets:  # subset is a tuple
        if subset not in previous_L:
            return True
    return False


def apriori_gen(previous_L: dict) -> dict:
    """
    A function generate candidates with respect to Lk-1 (previous_L). tries
    prone the results with the help of has_infrequent_subset(). for every new
    candidate found, if all of its subsets with the length of k-1 are not
    frequent in Lk-1 (previous_L), it will not be added to the result.
    """
    Ck = {}
    for item_1, TIDs1 in previous_L.items():
        for item_2, TIDs2 in previous_L.items():
            if item_1[:-1] == item_2[:-1] and item_1[-1] < item_2[-1]:
                new_item = tuple([*item_1, item_2[-1]])
                if has_infrequent_subset(new_item, previous_L):
                    continue
                new_TIDs = TIDs1 if len(TIDs1) < len(TIDs2) else TIDs2
                Ck[new_item] = new_TIDs
    return Ck


def generate_L1(dataset: list, min_support_count: int) -> dict:
    """
    Generates L1 itemset from given dataset with respect to min_support_count

    Parameters
    ----------
    dataset -- a list of lists. each inner list represent a transaction which
        its content are items bought in that transacton. the outer list is the
        dataset which contain all transactions.
    min_support_count -- an integer which is used to check whether one item is
        frequent or not.

    Returns
    -------
    a dictionary with keys representing L1 frequent items fount and values
    representing what transactions that item appeared in. the values are sets.
    the values will be useful later as this paper demonstrates:
    https://arxiv.org/pdf/1403.3948.pdf

    Examples
    --------
    >>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 3)
    {(3,): {0, 1, 2}}
    >>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 2)
    {(1,): {0, 1}, (3,): {0, 1, 2}, (4,): {1, 2}}
    """
    L1 = {}
    for TID, transaction in enumerate(dataset):
        for item in transaction:
            if (item,) not in L1:
                L1[(item,)] = set()
            L1[(item,)].add(TID)
    return {item: TIDs for item, TIDs in L1.items()
            if len(TIDs) >= min_support_count}


def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    st = time()
    Lk = {}
    for candidate, TIDs in Ck.items():
        if len(TIDs) < min_support_count:
            continue
        new_TIDs = set()
        tt = time()
        for TID in TIDs:
            comp_transaction = dataset[TID]
            # equivalent to: if candidate.issubset(dataset[TID])
            # this is the slowest part of the code and this is how to make it a
            # bit faster
            if not any(item not in comp_transaction for item in candidate):
                new_TIDs.add(TID)
        if len(new_TIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk


def apriori(min_support_count: int, dataset: list):
    st = time()
    L1 = generate_L1(dataset, min_support_count)
    L = {1: L1}
    for k in range(2, 1000):
        if len(L[k-1]) < 2:
            break
        Ck = apriori_gen(L[k-1])
        L[k] = gen_Lk(Ck, dataset, min_support_count)
    return L


if __name__ == '__main__':
    with open(paths.q1.listed_ds, 'rb') as f:
        dataset = pickle.load(f)
    L = apriori(len(dataset)*0.03, dataset)

    result = []
    for _, Lk in L.items():
        result.extend(Lk.keys())
    print(result, len(result))
python algorithm data-structures apriori
1个回答
1
投票

可能对你的任务来说有点晚,但是。

计算新的TIDS已经在apriori_gen中出现了

new_TIDs = TIDs1.intersection(TIDs2)

然后像这样重新使用新的TIDs就可以了

def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    Lk = {}
    for candidate, newTIDs in Ck.items():
        if len(newTIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk
© www.soinside.com 2019 - 2024. All rights reserved.