想要使用 bash 脚本合并放置在 Google 云存储中的多个 csv 文件。
我创建了一个不起作用的脚本,因为它只是合并第一个和最后一个文件并忽略中间的所有文件。就像合并 2 个文件一样,它工作正常,但如果文件计数增加超过 2,那么它只会合并 file1 和 file3,而 simples 会忽略 file2。
以下是参考脚本:
#!/bin/bash
# Check if the bucket name is provided as an argument
if [ -z "$1" ]; then
echo "Usage: $0 <bucket_name>"
exit 1
fi
# Assign the first argument (bucket name) to BUCKET_NAME
BUCKET_NAME="$1"
STARTING_TEXT="$2"
TARGET_FILE_NAME="$3"
# Generate a timestamp
TIMESTAMP=$(date +"%Y%m%d%H%M%S")
# List all files starting with 'STARTING_TEXT'
FILES=($(gsutil ls gs://$BUCKET_NAME/$STARTING_TEXT*.csv))
# Initialize the merged file with the first file's content (including header)
gsutil cp ${FILES[0]} /tmp/merged_file_with_header.csv
# Print the names of the files being merged
echo "Merging the following files:"
for FILE in "${FILES[@]}"
do
echo "$FILE"
done
# Loop through the remaining files (starting from the second file)
for ((i=1; i<${#FILES[@]}; i++))
do
# Download the current file without the header
gsutil cat ${FILES[i]} | tail -n +2 > /tmp/temp_file.csv
# Append the content to the merged file
cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv > /tmp/merged_file.csv
# Upload the merged file back to the bucket
gsutil cp /tmp/merged_file.csv gs://$BUCKET_NAME/$TARGET_FILE_NAME_$TIMESTAMP.csv
# # Update the merged file with header for next iteration
# gsutil cp gs://$BUCKET_NAME/file_with_header.csv /tmp/merged_file_with_header.csv
done
# Clean up temporary files
rm /tmp/temp_file.csv /tmp/merged_file.csv /tmp/merged_file_with_header.csv
帮我改正我遗漏的地方。
根据您的评论,您想要append,但实际上您覆盖了:
cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv > /tmp/merged_file.csv
做一个
cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv >> /tmp/merged_file.csv
相反。
工作代码之后我发现,我在循环内合并文件,这是不正确的。
#!/bin/bash
# Check if the bucket name is provided as an argument
if [ -z "$1" ]; then
echo "Usage: $0 <bucket_name>"
exit 1
fi
# Assign the first argument (bucket name) to BUCKET_NAME
BUCKET_NAME="$1"
STARTING_TEXT="$2"
TARGET_FILE_NAME="$3"
# Generate a timestamp
TIMESTAMP=$(date +"%Y%m%d%H%M%S")
# List all files starting with 'STARTING_TEXT'
FILES=($(gsutil ls gs://$BUCKET_NAME/$STARTING_TEXT*.csv))
# Initialize the merged file with the first file's content (including header)
gsutil cp ${FILES[0]} /tmp/header.csv
sleep 5
cat /tmp/header.csv
# Print the names of the files being merged
echo "Merging the following files:"
for FILE in "${FILES[@]}"
do
echo "$FILE"
done
#Appending file with header in merged_file which will be final output file
cat /tmp/header.csv >> /tmp/merged_file.csv
# Loop through the remaining files (starting from the second file)
for ((i=1; i<${#FILES[@]}; i++))
do
# Download the current file without the header
gsutil cat ${FILES[i]} | tail -n +2 > /tmp/temp_file.csv
# Append the content to the merged file
cat /tmp/temp_file.csv >> /tmp/merged_file.csv
# Upload the merged file back to the bucket
gsutil cp /tmp/merged_file.csv gs://$BUCKET_NAME/$TARGET_FILE_NAME_$TIMESTAMP.csv
done
echo "printing final file"
cat /tmp/merged_file.csv
# Clean up temporary files
rm /tmp/temp_file.csv /tmp/merged_file.csv