uniq只有一部分线

Question

我正在尝试合并一个电子邮件列表，但我希望通过电子邮件地址uniq（或uniq -i -u），而不是整行，以便我们没有重复项。

清单1：

Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>

清单2：

firstname lastname <[email protected]>
Fake Person <[email protected]>
Joe lastnanme <[email protected]>

目前的输出是

Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Fake Person <[email protected]>
Joe lastnanme <[email protected]>

期望的输出将是

Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>

（因为[email protected]列在两者中）

我怎样才能做到这一点？

Answer 1

这是awk中的一个：

$ awk '
match($0,/[a-z0-9.]+@[a-z.]+/) {      # look for emailish string *
    a[substr($0,RSTART,RLENGTH)]=$0   # and hash the record using the address as key
}
END {                                 # after all are processed
    for(i in a)                       # output them in no particular order
        print a[i]
}' file2 file1                        # switch order to see how it affects output

产量

Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
Joe lastnanme <[email protected]>
firstname lastname <[email protected]>

脚本查找非常简单的电子邮件字符串（*查看脚本中的正则表达式并根据自己的喜好调整它），它用于散列整个记录，最后一个实例获胜，因为早期的onse被覆盖。

Answer 2

给出你的文件格式

$ awk -F'[<>]' '!a[$2]++' files

将在带角度的括号中打印重复内容的第一个实例。或者，如果电子邮件后没有内容，则无需取消包装角度括号

$ awk '!a[$NF]++' files

同样可以用sort完成

$ sort -t'<' -k2,2 -u files

副作用输出将被排序，可以是期望的（或不是）。

注：对于两种替代方案，假设角度括号不会出现在电子邮件包装器之外的任何地方。

Answer 3

uniq有一个-f选项可以忽略一些空白分隔的字段，所以我们可以对第三个字段进行排序，然后忽略前两个字段：

$ sort -k 3,3 infile | uniq -f 2
Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>

但是，这不是很强大：只要在电子邮件地址之前没有两个字段就会中断，因为排序将在错误的字段上，uniq将比较错误的字段。

检查karakfa的答案，看看这里甚至不需要uniq。

或者，只检查最后一个字段的唯一性：

awk '!e[$NF] {print; ++e[$NF]}' infile

甚至更短，从karakfa，awk '!e[$NF]++' infile窃取

Answer 4

你可以尝试一下吗？

awk '
{
   match($0,/<.*>/)
   val=substr($0,RSTART,RLENGTH)
}
FNR==NR{
   a[val]=$0
   print
   next
}
!(val in a)
' list1 list2

说明：添加上述代码的说明。

awk '                                    ##Starting awk program here.
{                                        ##Starting BLOCK which will be executed for both of the Input_files.
   match($0,/<.*>/)                      ##Using match function of awk where giving regex to match everything from < to till >
   val=substr($0,RSTART,RLENGTH)         ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
}                                        ##Closing above BLOCK here.
FNR==NR{                                 ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
   a[val]=$0                             ##Creating an array named a whose index is val and value is current line.
   print $0                              ##Printing current line here.
   next                                  ##next will skip all further statements from here.
}
!(val in a)                              ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2                            ##Mentioning Input_file names here.

输出如下。

Company A <[email protected]>
Company B <[email protected]>
Company C <[email protected]>
firstname lastname <[email protected]>
Joe lastnanme <[email protected]>

Answer 5

也许我不明白这个问题！但你可以尝试这个awk：

awk 'NR!=FNR && $3 in a{next}{a[$3]}1' list1 list2

uniq只有一部分线

问题描述投票：1回答：5

5个回答

最新问题

uniq只有一部分线

问题描述 投票：1回答：5

5个回答

最新问题

问题描述投票：1回答：5