是否可以根据特定列提取所有重复记录?

问题描述 投票:0回答:3

我正在尝试从管道定界文件中提取所有(仅)重复值。

我的数据文件有80万行,其中有多列,我对第3列特别感兴趣。因此,我需要获取第3列的重复值,并从该文件中提取所有重复的行。

但是我能够实现此目标,如下所示。

cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt

并且我将上述内容带入循环,如下所示。

while read dup
do
   grep "$dup" Report.txt >>only_dup.txt
done <dup.txt

我也尝试过awk方法

while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt

但是,由于文件中包含大量记录,因此需要很长时间才能完成。因此,我正在寻找一种简便快捷的选择。

例如,我有这样的数据:

1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements

以及我的预期输出,其中不包括唯一记录:

2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
bash unix awk ksh
3个回答
0
投票

另一个awk:

$ awk '{
    n=$1                       # store number
    # sub("^" n,"",$0)           # remove from $0 (not my brightest moment)
    sub(/^[^ ]*/,"",$0)        # better, see above :D
    if($0 in a) {              # if $0 in a
        if(a[$0]==1)           # if $0 seen the second time
            print b[$0] $0     # print number and rest
        print n $0             # also print current
    }
    a[$0]++                    # increase match count for $0
    b[$0]=n                    # number stored to b and only needed once
}' file

示例数据的输出:

2 learning Unix Business Team
4 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
6 learning Unix Business Team

0
投票

您问题中的示例尚不清楚,但给出了此输入文件:

$ cat file
1 | whatever | learning Unix Business Requirements
2 | whatever | learning Unix Business Team
3 | whatever | learning Linux Business Requirements
4 | whatever | learning Unix Business Team
5 | whatever | learning Linux Business Requirements
6 | whatever | learning Unix Business Team
7 | whatever | learning Windows Business Requirements

这可能是您想要的:

$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
    if ( !prevPrinted++ ) {
        print prevRec
    }
    print
    next
}
{
    prevKey = currKey
    prevRec = $0
    prevPrinted = 0
}

$ sort -t'|' -k3,3 file | awk -f tst.awk
3 | whatever | learning Linux Business Requirements
5 | whatever | learning Linux Business Requirements
2 | whatever | learning Unix Business Team
4 | whatever | learning Unix Business Team
6 | whatever | learning Unix Business Team

使用新发布的示例输入运行以上操作:

$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
© www.soinside.com 2019 - 2024. All rights reserved.