我正在尝试用 python 编写一个脚本,该脚本将转到 AWS S3 Bucket 链接并计算有多少个文件。文件很多,但我只想统计名称以
file_
开头的文件。这些文件本质上是增量的。最重要的是,我还必须横移文件夹,因为其中的文件是视频的块,所以我也想横移质量并检查它们的块数量。
路径类似于:
s3://url/144p/
在此路径中会有需要计数的块。
我使用了boto3库。接下来是我的代码:
import csv
import boto3
# Base S3 URL
base_s3_url = 's3://coursevideotesting/' # Replace this with your base S3 URL
# Input and output CSV file names
input_csv_file = 'ldt_ffw_course_videos_temp.csv' # Replace with your input CSV file name
output_csv_file = 'file_count_result.csv' # Replace with your output CSV file name
# Function to count 'file_000.ts' objects in a specific S3 folder
def count_file_objects(s3_bucket, s3_folder):
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_folder)
# Count 'file_000.ts' objects in the folder
count = sum(1 for obj in response.get('Contents', []) if obj['Key'].startswith('file_000.ts'))
return count
# Read URLs from input CSV and check file counts
with open(input_csv_file, mode='r') as infile, open(output_csv_file, mode='w', newline='') as outfile:
reader = csv.DictReader(infile)
fieldnames = ['URL', 'Actual Files', 'Expected Files']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in reader:
s3_url = base_s3_url + row['course_video_s3_url'] # Replace 'URL_Column_Name' with your column name
expected_files = int(row['course_video_ts_file_cnt']) # Replace 'Expected_Files_Column_Name' with your column name
actual_files = count_file_objects('coursevideotesting', s3_url) # Replace 'your-s3-bucket-name' with your bucket name
writer.writerow({'URL': s3_url, 'Actual Files': actual_files, 'Expected Files': expected_files})
我收到了什么:
URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,0,28
s3://coursevideotesting/.../144p/,0,34
s3://coursevideotesting/.../144p/,0,54
s3://coursevideotesting/.../144p/,0,57
我的期望:
URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,28,28
s3://coursevideotesting/.../144p/,34,34
s3://coursevideotesting/.../144p/,52,54
s3://coursevideotesting/.../144p/,57,57
如果文件丢失/损坏,实际文件会小于实际文件。这样我就可以直接处理这些块,而不是手动检查所有块。我有 5 种质量类型,每个文件夹中有几个块。
您可能会发现使用
resource
方法比使用 client
方法更容易。他们会为你做分页(处理超过 1000 个对象),并且函数更加 Pythonic。
例如,您可以这样计算文件数:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('BUCKET-NAME')
prefix = 'foo/'
count = 0
for object in bucket.objects.filter(Prefix=prefix):
if object.key.endswith('/file_000.ts'):
count += 1
print(count)
请注意,此示例使用
endswith('/file_000.ts')
,因为对象的键可能如下所示:
144p/something/file_000.ts