是否可以根据特定列提取所有重复记录？

Question

我正在尝试从管道定界文件中提取所有（仅）重复值。

我的数据文件有80万行，其中有多列，我对第3列特别感兴趣。因此，我需要获取第3列的重复值，并从该文件中提取所有重复的行。

但是我能够实现此目标，如下所示。

cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt

并且我将上述内容带入循环，如下所示。

while read dup
do
   grep "$dup" Report.txt >>only_dup.txt
done <dup.txt

我也尝试过awk方法

while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt

但是，由于文件中包含大量记录，因此需要很长时间才能完成。因此，我正在寻找一种简便快捷的选择。

例如，我有这样的数据：

1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements

以及我的预期输出，其中不包括唯一记录：

2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements

Answer 1

另一个awk：

$ awk '{
    n=$1                       # store number
    # sub("^" n,"",$0)           # remove from $0 (not my brightest moment)
    sub(/^[^ ]*/,"",$0)        # better, see above :D
    if($0 in a) {              # if $0 in a
        if(a[$0]==1)           # if $0 seen the second time
            print b[$0] $0     # print number and rest
        print n $0             # also print current
    }
    a[$0]++                    # increase match count for $0
    b[$0]=n                    # number stored to b and only needed once
}' file

示例数据的输出：

2 learning Unix Business Team
4 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
6 learning Unix Business Team

Answer 2

您问题中的示例尚不清楚，但给出了此输入文件：

$ cat file
1 | whatever | learning Unix Business Requirements
2 | whatever | learning Unix Business Team
3 | whatever | learning Linux Business Requirements
4 | whatever | learning Unix Business Team
5 | whatever | learning Linux Business Requirements
6 | whatever | learning Unix Business Team
7 | whatever | learning Windows Business Requirements

这可能是您想要的：

$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
    if ( !prevPrinted++ ) {
        print prevRec
    }
    print
    next
}
{
    prevKey = currKey
    prevRec = $0
    prevPrinted = 0
}

$ sort -t'|' -k3,3 file | awk -f tst.awk
3 | whatever | learning Linux Business Requirements
5 | whatever | learning Linux Business Requirements
2 | whatever | learning Unix Business Team
4 | whatever | learning Unix Business Team
6 | whatever | learning Unix Business Team

使用新发布的示例输入运行以上操作：

$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team

是否可以根据特定列提取所有重复记录？

问题描述投票：0回答：3

3个回答

最新问题

是否可以根据特定列提取所有重复记录？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3