合并谷歌云存储中的多个csv文件,保留一个文件的标题| Bash 脚本

问题描述 投票:0回答:2

想要使用 bash 脚本合并放置在 Google 云存储中的多个 csv 文件。

我创建了一个不起作用的脚本,因为它只是合并第一个和最后一个文件并忽略中间的所有文件。就像合并 2 个文件一样,它工作正常,但如果文件计数增加超过 2,那么它只会合并 file1 和 file3,而 simples 会忽略 file2。

以下是参考脚本:

#!/bin/bash

# Check if the bucket name is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <bucket_name>"
  exit 1
fi

# Assign the first argument (bucket name) to BUCKET_NAME
BUCKET_NAME="$1"
STARTING_TEXT="$2"
TARGET_FILE_NAME="$3"
# Generate a timestamp
TIMESTAMP=$(date +"%Y%m%d%H%M%S")

# List all files starting with 'STARTING_TEXT'
FILES=($(gsutil ls gs://$BUCKET_NAME/$STARTING_TEXT*.csv))

# Initialize the merged file with the first file's content (including header)
gsutil cp ${FILES[0]} /tmp/merged_file_with_header.csv

# Print the names of the files being merged
echo "Merging the following files:"
for FILE in "${FILES[@]}"
do
    echo "$FILE"
done

# Loop through the remaining files (starting from the second file)
for ((i=1; i<${#FILES[@]}; i++))
do
    # Download the current file without the header
    gsutil cat ${FILES[i]} | tail -n +2 > /tmp/temp_file.csv

    # Append the content to the merged file
    cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv > /tmp/merged_file.csv

    # Upload the merged file back to the bucket
    gsutil cp /tmp/merged_file.csv gs://$BUCKET_NAME/$TARGET_FILE_NAME_$TIMESTAMP.csv

    # # Update the merged file with header for next iteration
    # gsutil cp gs://$BUCKET_NAME/file_with_header.csv /tmp/merged_file_with_header.csv
done
# Clean up temporary files
rm /tmp/temp_file.csv /tmp/merged_file.csv /tmp/merged_file_with_header.csv

帮我改正我遗漏的地方。

bash shell csv google-cloud-storage
2个回答
1
投票

根据您的评论,您想要append,但实际上您覆盖了:

cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv > /tmp/merged_file.csv

做一个

cat /tmp/merged_file_with_header.csv /tmp/temp_file.csv >> /tmp/merged_file.csv

相反。


0
投票

工作代码之后我发现,我在循环内合并文件,这是不正确的。

#!/bin/bash

# Check if the bucket name is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <bucket_name>"
  exit 1
fi

# Assign the first argument (bucket name) to BUCKET_NAME
BUCKET_NAME="$1"
STARTING_TEXT="$2"
TARGET_FILE_NAME="$3"
# Generate a timestamp
TIMESTAMP=$(date +"%Y%m%d%H%M%S")

# List all files starting with 'STARTING_TEXT'
FILES=($(gsutil ls gs://$BUCKET_NAME/$STARTING_TEXT*.csv))

# Initialize the merged file with the first file's content (including header)
gsutil cp ${FILES[0]} /tmp/header.csv
sleep 5
cat /tmp/header.csv
# Print the names of the files being merged
echo "Merging the following files:"
for FILE in "${FILES[@]}"
do
    echo "$FILE"
done

#Appending file with header in merged_file which will be final output file
cat /tmp/header.csv >> /tmp/merged_file.csv

# Loop through the remaining files (starting from the second file)
for ((i=1; i<${#FILES[@]}; i++))
do
    # Download the current file without the header
    gsutil cat ${FILES[i]} | tail -n +2 > /tmp/temp_file.csv

    # Append the content to the merged file
    cat /tmp/temp_file.csv >> /tmp/merged_file.csv

    # Upload the merged file back to the bucket
    gsutil cp /tmp/merged_file.csv gs://$BUCKET_NAME/$TARGET_FILE_NAME_$TIMESTAMP.csv

done

echo "printing final file"
cat /tmp/merged_file.csv

# Clean up temporary files
rm /tmp/temp_file.csv /tmp/merged_file.csv
© www.soinside.com 2019 - 2024. All rights reserved.