这是我的初始输出文件的示例:
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial
理想情况下,我希望我的输出将物种名称与凭证名称的其余部分分开,因此输出将是:
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata", ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.", "4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
我可以在最后一列周围添加引号的 awk 命令是:
mawk '$2 = "\"" $2' FS=, OFS=, ORS='"\n'
但我不知道如何更改此设置以便用引号分隔前两个术语(属和种名)。我需要这样做,以便我可以使用物种名称进一步操作数据。
谢谢!
我建议您澄清问题下方提出的问题。
根据您之前的问题和我的解释,我认为这个 awk 应该做您想要的事情:
awk 'match($0, /^([^,]*,){10}/) {
p = substr($0, 1, RLENGTH)
s = substr($0, RLENGTH+1)
if (match(s, /^[^ ]+ +[^ ]+/)) {
species = substr(s, 1, RLENGTH)
voucher = substr(s, RLENGTH+1)
sub(/^ +/, "", voucher)
print p "\"" species "\",\"" voucher "\""
}
}' file
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum","voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum","voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata","ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.","4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"