如何使用 python 脚本和 boto3 库计算 AWS S3 服务器中的文件数量

Question

我正在尝试用 python 编写一个脚本，该脚本将转到 AWS S3 Bucket 链接并计算有多少个文件。文件很多，但我只想统计名称以

file_

开头的文件。这些文件本质上是增量的。最重要的是，我还必须横移文件夹，因为其中的文件是视频的块，所以我也想横移质量并检查它们的块数量。

路径类似于：

s3://url/144p/

在此路径中会有需要计数的块。

我使用了boto3库。接下来是我的代码：

import csv
import boto3

# Base S3 URL
base_s3_url = 's3://coursevideotesting/'  # Replace this with your base S3 URL

# Input and output CSV file names
input_csv_file = 'ldt_ffw_course_videos_temp.csv'  # Replace with your input CSV file name
output_csv_file = 'file_count_result.csv'  # Replace with your output CSV file name

# Function to count 'file_000.ts' objects in a specific S3 folder
def count_file_objects(s3_bucket, s3_folder):
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_folder)

    # Count 'file_000.ts' objects in the folder
    count = sum(1 for obj in response.get('Contents', []) if obj['Key'].startswith('file_000.ts'))
    return count

# Read URLs from input CSV and check file counts
with open(input_csv_file, mode='r') as infile, open(output_csv_file, mode='w', newline='') as outfile:
    reader = csv.DictReader(infile)
    fieldnames = ['URL', 'Actual Files', 'Expected Files']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:
        s3_url = base_s3_url + row['course_video_s3_url']  # Replace 'URL_Column_Name' with your column name
        expected_files = int(row['course_video_ts_file_cnt'])  # Replace 'Expected_Files_Column_Name' with your column name
        actual_files = count_file_objects('coursevideotesting', s3_url)  # Replace 'your-s3-bucket-name' with your bucket name
        
        writer.writerow({'URL': s3_url, 'Actual Files': actual_files, 'Expected Files': expected_files})

我收到了什么：

URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,0,28
s3://coursevideotesting/.../144p/,0,34
s3://coursevideotesting/.../144p/,0,54
s3://coursevideotesting/.../144p/,0,57

我的期望：

URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,28,28
s3://coursevideotesting/.../144p/,34,34
s3://coursevideotesting/.../144p/,52,54
s3://coursevideotesting/.../144p/,57,57

如果文件丢失/损坏，实际文件会小于实际文件。这样我就可以直接处理这些块，而不是手动检查所有块。我有 5 种质量类型，每个文件夹中有几个块。

Answer 1

您可能会发现使用

resource

方法比使用

client

方法更容易。他们会为你做分页（处理超过 1000 个对象），并且函数更加 Pythonic。

例如，您可以这样计算文件数：

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('BUCKET-NAME')

prefix = 'foo/'

count = 0

for object in bucket.objects.filter(Prefix=prefix):
    if object.key.endswith('/file_000.ts'):
        count += 1

print(count)

请注意，此示例使用

endswith('/file_000.ts')

，因为对象的键可能如下所示：

144p/something/file_000.ts

如何使用 python 脚本和 boto3 库计算 AWS S3 服务器中的文件数量

问题描述投票：0回答：1

1个回答

最新问题

如何使用 python 脚本和 boto3 库计算 AWS S3 服务器中的文件数量

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1