awk 未找到索引的所有匹配项

问题描述 投票:0回答:1

我有两个文件,使用管道作为分隔符,在每个文件中生成两列。每次第一个文件的第二列与第二个文件的第一列之间存在匹配时,我想合并第二个文件中的第二列。我的 awk 似乎只能识别一些匹配项,而不能识别其他匹配项,我无法理解为什么它不起作用。在测试数据集中,它应该匹配 g1001.t1 的所有实例。

test.addgene.txt 文件内容:

ptg000013l  AUGUSTUS    gene    7594135 7594636 0.57    +   .   ID=g1000;|
ptg000013l  AUGUSTUS    mRNA    7594135 7594636 0.57    +   .   ID=g1000.t1;Parent=g1000;|g1000
ptg000013l  AUGUSTUS    start_codon 7594135 7594137 .   +   0   ID=g1000.t1.start1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    CDS 7594135 7594312 0.6 +   0   ID=g1000.t1.CDS1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    exon    7594135 7594312 .   +   .   ID=g1000.t1.exon1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    intron  7594313 7594367 0.68    +   .   ID=g1000.t1.intron1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    CDS 7594368 7594636 0.68    +   2   ID=g1000.t1.CDS2;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    exon    7594368 7594636 .   +   .   ID=g1000.t1.exon2;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    stop_codon  7594634 7594636 .   +   0   ID=g1000.t1.stop1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    gene    7594770 7599695 0.46    +   .   ID=g1001;|
ptg000013l  AUGUSTUS    mRNA    7594770 7599695 0.46    +   .   ID=g1001.t1;Parent=g1001;|g1001
ptg000013l  AUGUSTUS    start_codon 7594770 7594772 .   +   0   ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    CDS 7594770 7594848 0.9 +   0   ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    exon    7594770 7594848 .   +   .   ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    intron  7594849 7599270 0.8 +   .   ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    CDS 7599271 7599695 0.48    +   2   ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    exon    7599271 7599695 .   +   .   ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    gene    7611253 7611658 0.68    +   .   ID=g1002;|
ptg000013l  AUGUSTUS    mRNA    7611253 7611658 0.68    +   .   ID=g1002.t1;Parent=g1002;|g1002
ptg000013l  AUGUSTUS    start_codon 7611253 7611255 .   +   0   ID=g1002.t1.start1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    CDS 7611253 7611390 0.72    +   0   ID=g1002.t1.CDS1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    exon    7611253 7611390 .   +   .   ID=g1002.t1.exon1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    intron  7611391 7611439 0.78    +   .   ID=g1002.t1.intron1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    CDS 7611440 7611658 0.84    +   0   ID=g1002.t1.CDS2;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    exon    7611440 7611658 .   +   .   ID=g1002.t1.exon2;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    stop_codon  7611656 7611658 .   +   0   ID=g1002.t1.stop1;Parent=g1002.t1;|g1002.t1

test.names.txt 文件内容:

g1001.t1|sorting nexin-4
g10010.t1|2-methoxy-6-polyprenyl-1,4-benzoquinol methylase [EC:2.1.1.201]
g10012.t2|small nuclear ribonucleoprotein D3
g10013.t1|tetratricopeptide repeat protein 4
g10024.t1|ATP-binding cassette, subfamily C (CFTR/MRP), member 4
g10027.t1|synaptosomal-associated protein 29
g10032.t1|serine/threonine-protein phosphatase PP1 catalytic subunit [EC:3.1.3.16]
g10033.t1|ligand of Numb protein X 1/2 [EC:2.3.2.27]
g10034.t1|PAX-interacting protein 1
g10038.t1|zinc finger SWIM domain-containing protein 7
g10041.t1|neuronal cell adhesion molecule
g10045.t1|peptidyl-tRNA hydrolase, PTH2 family [EC:3.1.1.29]
g10060.t1|endonuclease G, mitochondrial
g1007.t2|protocadherin-16/23
g10072.t1|fatty acid synthase, animal type [EC:2.3.1.85]
g10078.t1|cathepsin B [EC:3.4.22.1]
g1009.t1|gem associated protein 8
g10090.t1|KRAB domain-containing zinc finger protein
g1010.t1|translation initiation factor 3 subunit K
g10117.t1|kinetochore protein NDC80
g1012.t1|T-complex protein 1 subunit epsilon

使用的代码:

awk 'BEGIN { FS=OFS="|"; }; NR==FNR{a[$2]=$1} ($1 in a){print a[$1], $0}' test.addgene.txt test.names.txt

结果输出仅与包含 stop_codon 的行匹配:

ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4

预期输出:

ptg000013l  AUGUSTUS    start_codon 7594770 7594772 .   +   0   ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    CDS 7594770 7594848 0.9 +   0   ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    exon    7594770 7594848 .   +   .   ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    intron  7594849 7599270 0.8 +   .   ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    CDS 7599271 7599695 0.48    +   2   ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    exon    7599271 7599695 .   +   .   ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4

在输入文件中查找奇数字符的结果表明没有意外:

LC_ALL=C sed -n l test.addgene.txt

ptg000013l\tAUGUSTUS\tgene\t7594135\t7594636\t0.57\t+\t.\tID=g1000;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594135\t7594636\t0.57\t+\t.\tID=g1000.t1;Parent=g1\
000;|g1000$
ptg000013l\tAUGUSTUS\tstart_codon\t7594135\t7594137\t.\t+\t0\tID=g1000.t1.start\
1;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594135\t7594312\t0.6\t+\t0\tID=g1000.t1.CDS1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594135\t7594312\t.\t+\t.\tID=g1000.t1.exon1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tintron\t7594313\t7594367\t0.68\t+\t.\tID=g1000.t1.intron1\
;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594368\t7594636\t0.68\t+\t2\tID=g1000.t1.CDS2;Paren\
t=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594368\t7594636\t.\t+\t.\tID=g1000.t1.exon2;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7594634\t7594636\t.\t+\t0\tID=g1000.t1.stop1;\
Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tgene\t7594770\t7599695\t0.46\t+\t.\tID=g1001;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594770\t7599695\t0.46\t+\t.\tID=g1001.t1;Parent=g1\
001;|g1001$
ptg000013l\tAUGUSTUS\tstart_codon\t7594770\t7594772\t.\t+\t0\tID=g1001.t1.start\
1;Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594770\t7594848\t0.9\t+\t0\tID=g1001.t1.CDS1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7594770\t7594848\t.\t+\t.\tID=g1001.t1.exon1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tintron\t7594849\t7599270\t0.8\t+\t.\tID=g1001.t1.intron1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7599271\t7599695\t0.48\t+\t2\tID=g1001.t1.CDS2;Paren\
t=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7599271\t7599695\t.\t+\t.\tID=g1001.t1.exon2;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7599693\t7599695\t.\t+\t0\tID=g1001.t1.stop1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tgene\t7611253\t7611658\t0.68\t+\t.\tID=g1002;|$
ptg000013l\tAUGUSTUS\tmRNA\t7611253\t7611658\t0.68\t+\t.\tID=g1002.t1;Parent=g1\
002;|g1002$
ptg000013l\tAUGUSTUS\tstart_codon\t7611253\t7611255\t.\t+\t0\tID=g1002.t1.start\
1;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611253\t7611390\t0.72\t+\t0\tID=g1002.t1.CDS1;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611253\t7611390\t.\t+\t.\tID=g1002.t1.exon1;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tintron\t7611391\t7611439\t0.78\t+\t.\tID=g1002.t1.intron1\
;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611440\t7611658\t0.84\t+\t0\tID=g1002.t1.CDS2;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611440\t7611658\t.\t+\t.\tID=g1002.t1.exon2;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7611656\t7611658\t.\t+\t0\tID=g1002.t1.stop1;\
Parent=g1002.t1;|g1002.t1$

这是一个测试数据集,我并不是在寻找特定于匹配 g1001.t1 的答案。希望有人能建议一些方法来帮助我解决 awk 无法找到并打印所有匹配项的问题。 (这是在 Macbook pro 上)。

bash awk bioinformatics
1个回答
0
投票

您每次都在 a["g1001.t1"]

替换
NR==FNR
的值,因此您最终只会捕获
stop_codon
中的
a["g1001.t1"]
线。

这个逻辑似乎有些低劣;您可能想将名称读入内存,然后将它们添加到 addgene 文件中的任何行,其中

$2
与名称文件中的
$1
相同。

awk 'BEGIN { FS=OFS="|" }
    NR==FNR { a[$1]=$2; next }
    ($2 in a) { print $0, a[$2] }' test.names.txt test.addgene.txt

还要注意

next
中的
NR==FNR
,以防止脚本失败并意外地从第一个输入文件中打印出某些内容;显然,效率也会有微小的提高。

输出示例:

ptg000013l      AUGUSTUS        start_codon     7594770 7594772 .       +             0       ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        CDS     7594770 7594848 0.9     +       0             ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        exon    7594770 7594848 .       +       .             ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        intron  7594849 7599270 0.8     +       .             ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        CDS     7599271 7599695 0.48    +       2             ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        exon    7599271 7599695 .       +       .             ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        stop_codon      7599693 7599695 .       +             0       ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
© www.soinside.com 2019 - 2024. All rights reserved.