我在 s3 存储桶(10m)中有大量文件,我想将这些文件写入文本文件以进行进一步处理。
问题是我如何有效地做到这一点?是:
aws s3 ls s3://bucketname > out.txt
唯一的选择?不过我只需要文件网址。如何才能实现这一目标?
您尝试过S3库存报告吗?不过,您仍然需要一些后期处理才能获得所需的格式。
编写一个 python shell 作业并从 S3 获取文件夹和文件名列表。 以下 python 作业获取文件夹及其文件名,然后将 CSV 文件写入另一个 S3 位置。
import sys
import os
from awsglue.utils import getResolvedOptions
import boto3
# Specify the S3 bucket and path
s3_bucket = "your_source_bucket"
specific_file = "" #only needed in case you want for a specific folder or file
# Create an S3 client
s3_client = boto3.client('s3')
# List objects in the bucket with pagination
paginator = s3_client.get_paginator('list_objects')
response_iterator = paginator.paginate(Bucket=s3_bucket,Prefix=specific_file)
# Extract folder names and file names
folder_names = set()
file_names = []
for response in response_iterator:
for content in response.get('Contents', []):
key = content['Key']
if '/' in key:
folder_name, file_name = os.path.split(key)
folder_names.add(folder_name)
file_names.append((folder_name, file_name))
else:
file_names.append(('', key))
# Save the list of file names to a CSV file
csv_output_path = "/tmp/s3_files.csv" # Use a local temporary file
with open(csv_output_path, 'w') as file:
file.write('Folder,File\n')
for folder_name, file_name in file_names:
file.write('{},{}\n'.format(folder_name, file_name))
# Upload the CSV file to S3
s3_client.upload_file(csv_output_path, "your_bucket", "your_prefix_and_file_name")
# Clean up the temporary file
os.remove(csv_output_path)