Boto3 下载子目录中的所有文件

我正在尝试使用 boto3 python SDK。我有一个存储桶标题“tmp”，并且有看起来像“my_test1/logABC1.json”、“my_test1/logABC2.json”、“my_test1/logABC3.json”等的键，然后是其他一些东西对我来说毫无意义我想要的是下载 my_test1 目录中的所有文件。这是我尝试过的：

counter = 1
client = boto3.client("s3") #access_keys/secrets and endpoints omitted for brevity
abc = client.list_objects(Bucket = "tmp")
for x in abc["Keys"]:
    if "my_test1" in x:
        location = "logABC"+counter+.".json"
        client.download_file("tmp", x, location)

只要我的 tmp 目录中的项目少于 1000 个，这就是“工作”。然后它根本不起作用，因为 list_objects 每个 boto3 [文档][1] 最多返回 1000 个元素，之后的任何内容都卡在云中。我的问题是，如何解决这个限制？我看到有一个 list_objects_v2 （从技术上讲）可以在前 1000 个键之后启动（需要一些工作），但我是否遗漏了一些东西，或者这是我最好的选择？如果这是我最好的选择，我是否只编写一个 while 循环，在 abc["Keys"].length 小于 1000 后终止？

顺便说一句，即使我直接打电话给

    client.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

只要“my_test1/logABC2.json”是第一个 1000 之后的键，就无法找到。我看到有一个资源这样的东西，如果我定义

    rsce = boto3.resource("s3") #access_keys/secrets and endpoints omitted for brevity
    rsce.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

即使“my_test1/logABC2.json”不在前 1000 个键中（或者至少我的示例测试有效），这也可以工作，但由于我不知道我正在寻找的确切文件名，这看起来不像是不错的选择。

我的问题是，如果存储桶非常大，如何下载子目录中的所有文件？我觉得我一定错过了什么或者做错了什么，因为这一定会出现在其他人身上。（子目录的使用松散，因为我知道不存在存储桶的子目录之类的东西，但是通过正确使用分隔符，您可以综合地摆脱它）。

感谢您的指点 [1]：

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects.html

0
投票

您需要多次调用 list_objects 才能获取包含 1000 个项目的每个“页面”。 boto3 提供了

分页器来使这变得更容易。例如：

#!/usr/bin/env python3

import boto3

counter = 1
# Access Keys/Secrets should not be in code, use "aws configure" or instance profiles
# See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
client = boto3.client("s3")
# Create a paginator to page through the responses
paginator = client.get_paginator('list_objects')
for page in paginator.paginate(Bucket='example-bucket'):
    # Operate on each page, technically it's possible for a page
    # to not return any contents, so use .get() here to handle
    # the case where a different response occurs
    for x in page.get('Contents', []):
        # From here out, the code is the same
        if "my_test1" in x:
            location = "logABC"+counter+."json"
            client.download_file("tmp", x, location)

问题描述投票：0回答：1

1个回答

最新问题

Boto3 下载子目录中的所有文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1