我有这个文件
$ cat test.txt
49808830/ccs 9492 TACA 3
175833950/ccs 971 ACCC 1
180422692/ccs 971 ACCC 10
110952448/ccs 9714 TAGAG 2
117309969/ccs 9714 TAGAG 4
119998610/ccs 9714 TAGAG 5
171509463/ccs 9714 TAGAT 4
虽然第 2 列值相同,但我需要这样做:
我做了第一部分,应该没问题:
$ awk -F ' ' 'NR==FNR{a[$2] += $4} END{for (i in a) print $1, i, $3, a[i]}' test.txt
171509463/ccs 9492 TAGAT 3
171509463/ccs 9714 TAGAT 15
171509463/ccs 971 TAGAT 11
或
awk '{ seen[$2] += $4 } END { for (i in seen) print $1, i, $3, seen[i] }' test.txt
对于第二部分,我应该得到:
49808830/ccs 9492 TACA 3 same #print "same" for line when only one occurence of column 2
175833950/ccs 971 ACCC 1 same
180422692/ccs 971 ACCC 10 same
110952448/ccs 9714 TAGAG 2 diff
117309969/ccs 9714 TAGAG 4 diff
119998610/ccs 9714 TAGAG 5 diff
171509463/ccs 9714 TAGAT 4 diff #because "TAGAT" is different of "TAGAG"
使用
GNU awk
的 2 遍解决方案(对于多维数组):
awk '
FNR==NR { a[$2][$3]
next
}
{ print $0, (length(a[$2]) == 1 ? "same" : "diff") }
' test.txt test.txt
使用
GNU awk
的 1 遍解决方案(对于多维数组):
awk '
{ lines[NR] = $0
col2[NR] = $2
a[$2][$3]
}
END { for (i=1;i<=NR;i++)
print lines[i], (length(a[col2[i]]) == 1 ? "same" : "diff")
}
' test.txt
适用于所有
awk
版本的 2 遍解决方案:
awk '
FNR==NR { key = $2 SUBSEP $3
if (! seen[key]++)
count[$2]++
next
}
{ print $0, (count[$2] == 1 ? "same" : "diff") }
' test.txt test.txt
适用于所有
awk
版本的 1-pass 解决方案:
awk '
{ lines[NR] = $0
col2[NR] = $2
key = $2 SUBSEP $3
if (! seen[key]++)
count[$2]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (count[col2[i]] == 1 ? "same" : "diff")
}
' test.txt
这些都会生成:
49808830/ccs 9492 TACA 3 same
175833950/ccs 971 ACCC 1 same
180422692/ccs 971 ACCC 10 same
110952448/ccs 9714 TAGAG 2 diff
117309969/ccs 9714 TAGAG 4 diff
119998610/ccs 9714 TAGAG 5 diff
171509463/ccs 9714 TAGAT 4 diff
对于你问题的第二部分(因为第一部分似乎已经解决):
$ awk -F ' ' '{print $0, ($3==p || p==0?"same":"diff"); p=$3; }' test.txt
49808830/ccs 9492 TACA 3 same
175833950/ccs 971 ACCC 1 diff
180422692/ccs 971 ACCC 10 same
110952448/ccs 9714 TAGAG 2 diff
117309969/ccs 9714 TAGAG 4 same
119998610/ccs 9714 TAGAG 5 same
171509463/ccs 9714 TAGAT 4 diff
print $0
将打印当前记录($3==p || p==0?"same":"diff")
将添加“same”,否则添加“diff”。p=$3
确保下一次比较是针对 $3
的当前值进行的。注意:测试第一个值,目前测试的是p等于0,有待改进。
NOTE2:输出与您预期的输出不同,但当我读到:“如果第 3 列值相同或不同”时似乎是有效的
注意3:
-F ' '
是awk的默认行为,所以你不需要指定它。
使用任何 awk 和 1 pass 只需在内存中一次存储 1 美元 2 的值:
$ cat tst.awk
$2 != prev[2] {
prt()
numUniq = numVals = 0
}
$2 == prev[2] {
numUniq += ( $3 == prev[3] ? 0 : 1 )
}
{
vals[++numVals] = $0
split($0,prev)
}
END {
prt()
}
function prt( i) {
for ( i=1; i<=numVals; i++ ) {
print vals[i], ( numUniq == 1 ? "same" : "diff" )
}
}
$ awk -f tst.awk test.txt
49808830/ccs 9492 TACA 3 diff
175833950/ccs 971 ACCC 1 diff
180422692/ccs 971 ACCC 10 diff
110952448/ccs 9714 TAGAG 2 same
117309969/ccs 9714 TAGAG 4 same
119998610/ccs 9714 TAGAG 5 same
171509463/ccs 9714 TAGAT 4 same