最后一列中的单独术语用引号引起来?

问题描述 投票:0回答:1

这是我的初始输出文件的示例:

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial

理想情况下,我希望我的输出将物种名称与凭证名称的其余部分分开,因此输出将是:

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"   
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"        
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata", ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"     
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.", "4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"

我可以在最后一列周围添加引号的 awk 命令是:

mawk '$2 = "\"" $2' FS=, OFS=, ORS='"\n'

但我不知道如何更改此设置以便用引号分隔前两个术语(属和种名)。我需要这样做,以便我可以使用物种名称进一步操作数据。

谢谢!

awk
1个回答
0
投票

我建议您澄清问题下方提出的问题。

根据您之前的问题和我的解释,我认为这个 awk 应该做您想要的事情:

awk 'match($0, /^([^,]*,){10}/) {
   p = substr($0, 1, RLENGTH)
   s = substr($0, RLENGTH+1)
   if (match(s, /^[^ ]+ +[^ ]+/)) {
      species = substr(s, 1, RLENGTH)
      voucher = substr(s, RLENGTH+1)
      sub(/^ +/, "", voucher)
      print p "\"" species "\",\"" voucher "\""
   }
}' file

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum","voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum","voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata","ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.","4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
© www.soinside.com 2019 - 2024. All rights reserved.