我有两个文件,使用管道作为分隔符,在每个文件中生成两列。每次第一个文件的第二列与第二个文件的第一列之间存在匹配时,我想合并第二个文件中的第二列。我的 awk 似乎只能识别一些匹配项,而不能识别其他匹配项,我无法理解为什么它不起作用。在测试数据集中,它应该匹配 g1001.t1 的所有实例。
test.addgene.txt 文件内容:
ptg000013l AUGUSTUS gene 7594135 7594636 0.57 + . ID=g1000;|
ptg000013l AUGUSTUS mRNA 7594135 7594636 0.57 + . ID=g1000.t1;Parent=g1000;|g1000
ptg000013l AUGUSTUS start_codon 7594135 7594137 . + 0 ID=g1000.t1.start1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS CDS 7594135 7594312 0.6 + 0 ID=g1000.t1.CDS1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS exon 7594135 7594312 . + . ID=g1000.t1.exon1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS intron 7594313 7594367 0.68 + . ID=g1000.t1.intron1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS CDS 7594368 7594636 0.68 + 2 ID=g1000.t1.CDS2;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS exon 7594368 7594636 . + . ID=g1000.t1.exon2;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS stop_codon 7594634 7594636 . + 0 ID=g1000.t1.stop1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS gene 7594770 7599695 0.46 + . ID=g1001;|
ptg000013l AUGUSTUS mRNA 7594770 7599695 0.46 + . ID=g1001.t1;Parent=g1001;|g1001
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS gene 7611253 7611658 0.68 + . ID=g1002;|
ptg000013l AUGUSTUS mRNA 7611253 7611658 0.68 + . ID=g1002.t1;Parent=g1002;|g1002
ptg000013l AUGUSTUS start_codon 7611253 7611255 . + 0 ID=g1002.t1.start1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS CDS 7611253 7611390 0.72 + 0 ID=g1002.t1.CDS1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS exon 7611253 7611390 . + . ID=g1002.t1.exon1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS intron 7611391 7611439 0.78 + . ID=g1002.t1.intron1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS CDS 7611440 7611658 0.84 + 0 ID=g1002.t1.CDS2;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS exon 7611440 7611658 . + . ID=g1002.t1.exon2;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS stop_codon 7611656 7611658 . + 0 ID=g1002.t1.stop1;Parent=g1002.t1;|g1002.t1
test.names.txt 文件内容:
g1001.t1|sorting nexin-4
g10010.t1|2-methoxy-6-polyprenyl-1,4-benzoquinol methylase [EC:2.1.1.201]
g10012.t2|small nuclear ribonucleoprotein D3
g10013.t1|tetratricopeptide repeat protein 4
g10024.t1|ATP-binding cassette, subfamily C (CFTR/MRP), member 4
g10027.t1|synaptosomal-associated protein 29
g10032.t1|serine/threonine-protein phosphatase PP1 catalytic subunit [EC:3.1.3.16]
g10033.t1|ligand of Numb protein X 1/2 [EC:2.3.2.27]
g10034.t1|PAX-interacting protein 1
g10038.t1|zinc finger SWIM domain-containing protein 7
g10041.t1|neuronal cell adhesion molecule
g10045.t1|peptidyl-tRNA hydrolase, PTH2 family [EC:3.1.1.29]
g10060.t1|endonuclease G, mitochondrial
g1007.t2|protocadherin-16/23
g10072.t1|fatty acid synthase, animal type [EC:2.3.1.85]
g10078.t1|cathepsin B [EC:3.4.22.1]
g1009.t1|gem associated protein 8
g10090.t1|KRAB domain-containing zinc finger protein
g1010.t1|translation initiation factor 3 subunit K
g10117.t1|kinetochore protein NDC80
g1012.t1|T-complex protein 1 subunit epsilon
使用的代码:
awk 'BEGIN { FS=OFS="|"; }; NR==FNR{a[$2]=$1} ($1 in a){print a[$1], $0}' test.addgene.txt test.names.txt
结果输出仅与包含 stop_codon 的行匹配:
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
预期输出:
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
在输入文件中查找奇数字符的结果表明没有意外:
LC_ALL=C sed -n l test.addgene.txt
ptg000013l\tAUGUSTUS\tgene\t7594135\t7594636\t0.57\t+\t.\tID=g1000;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594135\t7594636\t0.57\t+\t.\tID=g1000.t1;Parent=g1\
000;|g1000$
ptg000013l\tAUGUSTUS\tstart_codon\t7594135\t7594137\t.\t+\t0\tID=g1000.t1.start\
1;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594135\t7594312\t0.6\t+\t0\tID=g1000.t1.CDS1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594135\t7594312\t.\t+\t.\tID=g1000.t1.exon1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tintron\t7594313\t7594367\t0.68\t+\t.\tID=g1000.t1.intron1\
;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594368\t7594636\t0.68\t+\t2\tID=g1000.t1.CDS2;Paren\
t=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594368\t7594636\t.\t+\t.\tID=g1000.t1.exon2;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7594634\t7594636\t.\t+\t0\tID=g1000.t1.stop1;\
Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tgene\t7594770\t7599695\t0.46\t+\t.\tID=g1001;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594770\t7599695\t0.46\t+\t.\tID=g1001.t1;Parent=g1\
001;|g1001$
ptg000013l\tAUGUSTUS\tstart_codon\t7594770\t7594772\t.\t+\t0\tID=g1001.t1.start\
1;Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594770\t7594848\t0.9\t+\t0\tID=g1001.t1.CDS1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7594770\t7594848\t.\t+\t.\tID=g1001.t1.exon1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tintron\t7594849\t7599270\t0.8\t+\t.\tID=g1001.t1.intron1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7599271\t7599695\t0.48\t+\t2\tID=g1001.t1.CDS2;Paren\
t=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7599271\t7599695\t.\t+\t.\tID=g1001.t1.exon2;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7599693\t7599695\t.\t+\t0\tID=g1001.t1.stop1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tgene\t7611253\t7611658\t0.68\t+\t.\tID=g1002;|$
ptg000013l\tAUGUSTUS\tmRNA\t7611253\t7611658\t0.68\t+\t.\tID=g1002.t1;Parent=g1\
002;|g1002$
ptg000013l\tAUGUSTUS\tstart_codon\t7611253\t7611255\t.\t+\t0\tID=g1002.t1.start\
1;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611253\t7611390\t0.72\t+\t0\tID=g1002.t1.CDS1;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611253\t7611390\t.\t+\t.\tID=g1002.t1.exon1;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tintron\t7611391\t7611439\t0.78\t+\t.\tID=g1002.t1.intron1\
;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611440\t7611658\t0.84\t+\t0\tID=g1002.t1.CDS2;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611440\t7611658\t.\t+\t.\tID=g1002.t1.exon2;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7611656\t7611658\t.\t+\t0\tID=g1002.t1.stop1;\
Parent=g1002.t1;|g1002.t1$
这是一个测试数据集,我并不是在寻找特定于匹配 g1001.t1 的答案。希望有人能建议一些方法来帮助我解决 awk 无法找到并打印所有匹配项的问题。 (这是在 Macbook pro 上)。
您每次都在 a["g1001.t1"]
中
替换
NR==FNR
的值,因此您最终只会捕获 stop_codon
中的 a["g1001.t1"]
线。
这个逻辑似乎有些低劣;您可能想将名称读入内存,然后将它们添加到 addgene 文件中的任何行,其中
$2
与名称文件中的 $1
相同。
awk 'BEGIN { FS=OFS="|" }
NR==FNR { a[$1]=$2; next }
($2 in a) { print $0, a[$2] }' test.names.txt test.addgene.txt
还要注意
next
中的 NR==FNR
,以防止脚本失败并意外地从第一个输入文件中打印出某些内容;显然,效率也会有微小的提高。
输出示例:
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4