更改输出列。从多个 csv 文件复制特定列并将其写入新 csv 文件时,根据输入文件名命名。外壳脚本

问题描述 投票:0回答:1

我正在尝试从多个(超过数千个)CSV 文件中提取特定列(第 4 列)并将其写入新文件,以相同的方式,文件在文件夹中排序,第一个 CSV 文件将提供所有四列来自文件。 现在我想知道如何使输出文件的列名与输入 CSV 文件的名称相匹配。

CSV 文件名就像

> EE85723.R.csv
> EE85727.R.csv
> EE87894.R.csv
> EE88810.R.csv
> .......
> .......
> 

#!/bin/bash
rm -f out.csv
set -- *R.csv

cut -d , -f 1,2,3 -- "$1" >out.csv

for file do
    cut -d , -f 4 -- "$file" | paste -d , out.csv - >out.tmp &&
    mv out.tmp out.csv
done

我获得的输出文件的每一列都名为“副本”,因此很难确定哪一列来自哪个文件。

> chr,start,end,copy,copy,copy
> chr1,1,10000,0,0,0
> chr1,10001,20000,3.02441583128188,3.06941544044942,3.09651371393489
> chr1,20001,30000,1.87088110683025,1.83912070977027,1.91248096145222
> chr1,30001,40000,1.94510909384639,1.90006068018602,1.96470746277162
> chr1,40001,50000,0.576139127131562,0.588528490660998,0.635347605084456
> chr1,50001,60000,1.51250200836185,1.50849932321034,1.52994133230921
> chr1,60001,70000,0.681365714967938,0.676156428892953,0.699545565388925
> chr1,70001,80000,0.436354857763045,0.449640001550081,0.497235183366175
> chr1,80001,90000,1.05269567207548,1.04655014589231,1.06732707247313
> 

预期的结果是这样的,其中每个列标题名称都应代表不带扩展名的文件名。

> chr,start,end,EE85723.R,EE85727.R,EE87894.R

如果有人善意地建议一种潜在的方法来继续解决 Bash 中的这个问题,那将是一个巨大的帮助。

shell csv extract multiple-columns rename
1个回答
0
投票

我有一个用于类似目的的 awk 脚本;这会解决你的问题吗?

示例文件:

tail EE*.R.csv
==> EE85723.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.02441583128188
chr1,20001,30000,1.87088110683025
chr1,30001,40000,1.94510909384639
chr1,40001,50000,0.576139127131562
chr1,50001,60000,1.51250200836185
chr1,60001,70000,0.681365714967938
chr1,70001,80000,0.436354857763045
chr1,80001,90000,1.05269567207548

==> EE85727.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.06941544044942
chr1,20001,30000,1.83912070977027
chr1,30001,40000,1.90006068018602
chr1,40001,50000,0.588528490660998
chr1,50001,60000,1.50849932321034
chr1,60001,70000,0.676156428892953
chr1,70001,80000,0.449640001550081
chr1,80001,90000,1.04655014589231

==> EE87894.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.09651371393489
chr1,20001,30000,1.91248096145222
chr1,30001,40000,1.96470746277162
chr1,40001,50000,0.635347605084456
chr1,50001,60000,1.52994133230921
chr1,60001,70000,0.699545565388925
chr1,70001,80000,0.497235183366175

脚本(需要 GNU awk):

cat test.sh
#!/bin/bash

awk 'BEGIN {
    FS = OFS = ","
    PROCINFO["sorted_in"] = "@val_str_asc"
}

FNR == 1 {
    filecount++
    numfields[filecount] = NF
    if (NR == 1) {
        sub(".csv", "", FILENAME)
        header[++a] = $1
        header[++a] = $2
        header[++a] = $3
        header[++a] = FILENAME
    } else {
        sub(".csv", "", FILENAME)
        header[++a] = FILENAME
    }
}

FNR > 1 {
    for (j = 4; j <= NF; j++) {
        b[$1 FS $2 FS $3][filecount, j] = $j
    }
}

END {
    for (k = 1; k <= length(header); k++) {
        printf "%s%s", header[k], ((k < length(header)) ? OFS : ORS)
    }
    for (l in b) {
        printf "%s", l OFS
        for (m = 1; m <= filecount; m++) {
            for (n = 4; n <= numfields[m]; n++) {
                printf "%s%s", (b[l][m, n] == "" ? "NA" : b[l][m, n]), ((m + n < filecount + numfields[m]) ? OFS : ORS)
            }
        }
    }
}' EE*.R.csv

运行脚本:

./test.sh > output.csv
cat output.csv
chr,start,end,EE85723.R,EE85727.R,EE87894.R
chr1,1,10000,0,0,0
chr1,10001,20000,3.02441583128188,3.06941544044942,3.09651371393489
chr1,20001,30000,1.87088110683025,1.83912070977027,1.91248096145222
chr1,30001,40000,1.94510909384639,1.90006068018602,1.96470746277162
chr1,40001,50000,0.576139127131562,0.588528490660998,0.635347605084456
chr1,50001,60000,1.51250200836185,1.50849932321034,1.52994133230921
chr1,60001,70000,0.681365714967938,0.676156428892953,0.699545565388925
chr1,70001,80000,0.436354857763045,0.449640001550081,0.497235183366175
chr1,80001,90000,1.05269567207548,1.04655014589231,NA

注意与使用 cut 等工具相比,这是一种不同的方法;此脚本比较前三个字段(字符、开始、结束)的值,并在文件 XX 中的值与第一个文件匹配时打印这些值。如果前三个字段不匹配,例如如果文件缺少一行,它将用 NA 替换该值。在上面的示例中,文件“EE87894.R”缺少一行(没有 chr1,80001,90000),因此在输出文件中将其替换为“NA”。我认为这是您想要的结果,但如果不是,您可以更改脚本以适应。

© www.soinside.com 2019 - 2024. All rights reserved.