将 Apache Beam Python 与 GCP Dataflow 结合使用时，是否具体化 GroupByKey 的结果是否重要？

Question

将 Apache Beam Python 与 GCP Dataflow 结合使用时，具体化 GroupByKey 的结果是否存在缺点，例如，计算元素的数量。例如：

def consume_group_by_key(element):
    season, fruits = element

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


def consume_group_by_key_materialize(element):
    season, fruits = element

    num_fruits = len(list(fruits))
    print(f"There are {num_fruits} fruits grown in {season}")

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


(
    pipeline
    | 'Create produce counts' >> beam.Create([
        ('spring', 'strawberry'),
        ('spring', 'carrot'),
        ('spring', 'eggplant'),
        ('spring', 'tomato'),
        ('summer', 'carrot'),
        ('summer', 'tomato'),
        ('summer', 'corn'),
        ('fall', 'carrot'),
        ('fall', 'tomato'),
        ('winter', 'eggplant'),
    ])
    | 'Group counts per produce' >> beam.GroupByKey()
    | beam.ParDo(consume_group_by_key_generator)
)

分组后的值

fruits

是否作为生成器传递给了我的DoFn？使用

consume_group_by_key_materialize

而不是

consume_group_by_key

是否会降低性能？或者换句话说，通过像

fruits

这样的东西来实现

len(list(fruits))

？如果有数十亿个水果，这会耗尽我所有的记忆吗？

Answer 1

你是对的，

len(list(fruits))

会在获取它的大小之前在内存中具体化整个列表，而你的

consume_group_by_key

函数迭代可迭代一次并且（在像 Dataflow 这样的分布式运行器上）不需要将整个值集放入内存一次。

将 Apache Beam Python 与 GCP Dataflow 结合使用时，是否具体化 GroupByKey 的结果是否重要？

问题描述投票：0回答：1

1个回答

最新问题

将 Apache Beam Python 与 GCP Dataflow 结合使用时，是否具体化 GroupByKey 的结果是否重要？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1