我想比较file1的第二列和file2的最后一列(种类名称),以及它们是否与file1的print第一列和file2的所有列匹配。但是,当__
用作分隔符时,这些文件包含不同的字段分隔符,第二个文件具有不等的列数。两个文件只包含唯一的行。我试图用grep和部分行匹配解决这个问题,但似乎awk更适合这个。
菲尔1:
AF074611.1 Yersinia pestis
AE002160.2 Chlamydia muridarum
AE002162.1 Chlamydia muridarum
AE003849.1 Xylella fastidiosa
文件2:
o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__
o__Methylococcales;f__Crenotrichaceae;g__Crenothrix;s__Crenothrix polyspora
o__Methylococcales;f__;g__;s__
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella fastidiosa
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella taiwanensis
输出:
AF074611.1 o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
AE003849.1 o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella fastidiosa
我怎么能做到这一点?谢谢。
如果在一个或两个文件中存在匹配的重复键值,这将执行我期望您想要的操作:
$ awk '
NR==FNR { a[$2][$1]; next }
$NF in a { for (val in a[$NF]) print val, $0 }
' FS='\t' file1 FS='__' file2
e.f.:
$ cat file1
AF074611.1 Yersinia pestis
AE002160.2 Chlamydia muridarum
AE002162.1 Chlamydia muridarum
AE003849.1 Xylella fastidiosa
added_value Yersinia pestis
$ cat file2
o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__
o__Methylococcales;f__Crenotrichaceae;g__Crenothrix;s__Crenothrix polyspora
o__Methylococcales;f__;g__;s__
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella fastidiosa
o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella taiwanensis
o__added_here_too;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
$ awk 'NR==FNR{a[$2][$1];next} $NF in a{for (val in a[$NF]) print val, $0}' FS='\t' file1 FS='__' file2
AF074611.1 o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
added_value o__Enterobacterales;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
AE003849.1 o__Xanthomonadales;f__Xanthomonadaceae;g__Xylella;s__Xylella fastidiosa
AF074611.1 o__added_here_too;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
added_value o__added_here_too;f__Yersiniaceae;g__Yersinia;s__Yersinia pestis
上面使用GNU awk作为真正的多维数组,如果你没有gawk,它是一个简单的调整,使它适用于任何awk。
awk 'FNR==NR{a[$2]=$1;next} $5 in a {print a[$5],$0}' FS='\t' file1 FS='__' file2
在此脚本中,首先读取file1
,并将其字段记录在数组a
中。然后使用不同的字段分隔符处理第二个文件。