我正在尝试合并一个电子邮件列表,但我希望通过电子邮件地址uniq
(或uniq -i -u
),而不是整行,以便我们没有重复项。
清单1:
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
清单2:
firstname lastname <[email protected]>
Fake Person <[email protected]>
Joe lastnanme <[email protected]>
目前的输出是
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Fake Person <[email protected]>
Joe lastnanme <[email protected]>
期望的输出将是
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>
(因为[email protected]
列在两者中)
我怎样才能做到这一点?
这是awk中的一个:
$ awk '
match($0,/[a-z0-9.]+@[a-z.]+/) { # look for emailish string *
a[substr($0,RSTART,RLENGTH)]=$0 # and hash the record using the address as key
}
END { # after all are processed
for(i in a) # output them in no particular order
print a[i]
}' file2 file1 # switch order to see how it affects output
产量
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
Joe lastnanme <[email protected]>
firstname lastname <[email protected]>
脚本查找非常简单的电子邮件字符串(*查看脚本中的正则表达式并根据自己的喜好调整它),它用于散列整个记录,最后一个实例获胜,因为早期的onse被覆盖。
给出你的文件格式
$ awk -F'[<>]' '!a[$2]++' files
将在带角度的括号中打印重复内容的第一个实例。或者,如果电子邮件后没有内容,则无需取消包装角度括号
$ awk '!a[$NF]++' files
同样可以用sort
完成
$ sort -t'<' -k2,2 -u files
副作用输出将被排序,可以是期望的(或不是)。
注:对于两种替代方案,假设角度括号不会出现在电子邮件包装器之外的任何地方。
uniq
有一个-f
选项可以忽略一些空白分隔的字段,所以我们可以对第三个字段进行排序,然后忽略前两个字段:
$ sort -k 3,3 infile | uniq -f 2
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>
但是,这不是很强大:只要在电子邮件地址之前没有两个字段就会中断,因为排序将在错误的字段上,uniq
将比较错误的字段。
检查karakfa的答案,看看这里甚至不需要uniq
。
或者,只检查最后一个字段的唯一性:
awk '!e[$NF] {print; ++e[$NF]}' infile
甚至更短,从karakfa,awk '!e[$NF]++' infile
窃取
你可以尝试一下吗?
awk '
{
match($0,/<.*>/)
val=substr($0,RSTART,RLENGTH)
}
FNR==NR{
a[val]=$0
print
next
}
!(val in a)
' list1 list2
说明:添加上述代码的说明。
awk ' ##Starting awk program here.
{ ##Starting BLOCK which will be executed for both of the Input_files.
match($0,/<.*>/) ##Using match function of awk where giving regex to match everything from < to till >
val=substr($0,RSTART,RLENGTH) ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
} ##Closing above BLOCK here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
a[val]=$0 ##Creating an array named a whose index is val and value is current line.
print $0 ##Printing current line here.
next ##next will skip all further statements from here.
}
!(val in a) ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2 ##Mentioning Input_file names here.
输出如下。
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>
也许我不明白这个问题! 但你可以尝试这个awk:
awk 'NR!=FNR && $3 in a{next}{a[$3]}1' list1 list2