如何从Google Dataflow中的PCollection中获取元素列表并在管道中使用它来循环写入转换？

Question

我正在使用带有Python SDK的Google Cloud Dataflow。

我想要：

从主PCollection中获取唯一日期列表
循环遍历该列表中的日期以创建过滤的PCollections（每个都具有唯一的日期），并将每个过滤的PCollection写入BigQuery中时间分区表中的分区。

我怎样才能获得该列表？在下面的combine变换之后，我创建了一个ListPCollectionView对象但是我无法迭代该对象：

class ToUniqueList(beam.CombineFn):

    def create_accumulator(self):
        return []

    def add_input(self, accumulator, element):
        if element not in accumulator:
            accumulator.append(element)
        return accumulator

    def merge_accumulators(self, accumulators):
        return list(set(accumulators))

    def extract_output(self, accumulator):
        return accumulator


def get_list_of_dates(pcoll):

    return (pcoll
            | 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))

我做错了吗？最好的方法是什么？

谢谢。

Answer 1

不可能直接获取PCollection的内容 - Apache Beam或Dataflow管道更像是应该进行哪些处理的查询计划，其中PCollection是计划中的逻辑中间节点，而不是包含数据。主程序组装计划（管道）并将其解决。

但是，最终您要尝试将数据写入按日期分片的BigQuery表。此用例目前仅支持in the Java SDK，仅适用于流媒体管道。

有关根据数据将数据写入多个目的地的更一般处理方法，请按照BEAM-92进行操作。

另见Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

如何从Google Dataflow中的PCollection中获取元素列表并在管道中使用它来循环写入转换？

问题描述投票：2回答：1

1个回答

最新问题

如何从Google Dataflow中的PCollection中获取元素列表并在管道中使用它来循环写入转换？

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1