如何在文本文件中检测到大于n的“凹陷”(孔,不匹配图案的线)序列?

问题描述 投票:2回答:3

案例场景:

$ cat Status.txt
1,connected
2,connected
3,connected
4,connected
5,connected
6,connected
7,disconnected
8,disconnected
9,disconnected
10,disconnected
11,disconnected
12,disconnected
13,disconnected
14,connected
15,connected
16,connected
17,disconnected
18,connected
19,connected
20,connected
21,disconnected
22,disconnected
23,disconnected
24,disconnected
25,disconnected
26,disconnected
27,disconnected
28,disconnected
29,disconnected
30,connected

可以看出,存在“空洞”,将它们理解为序列文件中具有“断开”值的行。

事实上,我想要检测这些“漏洞”,但如果我能在序列中设置缺失数字的最小n将会很有用。 即:对于'n = 5',可检测的孔将是7... 13部分,因为序列上的行中至少有5个“断开”。但是,在这种情况下,不应认为缺失的17是可检测的。再次,在第21行获得有效的断开连接。

就像是:

$ detector Status.txt -n 5 --pattern connected
7
21

......可以解释为:

- Missing more than 5 "connected" starting at 7.
- Missing more than 5 "connected" starting at 21.

我需要在Linux shell上编写脚本,所以我在考虑编写一些循环,解析字符串等等,但我觉得如果可以通过使用linux shell工具和一些更简单的编程来完成。有办法吗?

即使像csvtool这样的小程序是有效的解决方案,在使用嵌入式设备时,一些更常见的Linux命令(如grepcutawksedwc等)对我来说也是值得的。

linux shell text-processing
3个回答
4
投票
#!/usr/bin/env bash
last_connected=0
min_hole_size=${1:-5}  # default to 5, or take an argument from the command line
while IFS=, read -r num state; do
  if [[ $state = connected ]]; then
    if (( (num-last_connected) > (min_hole_size+1) )); then
      echo "Found a hole running from $((last_connected + 1)) to $((num - 1))"
    fi
    last_connected=$num
  fi
done

# Special case: Need to also handle a hole that's still open at EOF.
if [[ $state != connected ]] && (( num - last_connected > min_hole_size )); then
  echo "Found a hole running from $((last_connected + 1)) to $num"
fi

...给出你在stdin上的文件(./detect-holes <in.txt):

Found a hole running from 7 to 13
Found a hole running from 21 to 29

看到:

  • BashFAQ #1 - 如何逐行(和/或逐字段)读取文件(数据流,变量)?
  • The conditional expression - [[ ]]语法,用于在不引用扩展的情况下安全地进行字符串比较。
  • Arithmetic comparison syntax - 在所有符合POSIX标准的贝壳中的$(( ))中有效;也可以没有扩展副作用,因为(( ))作为bash扩展。

3
投票

这是awk的完美用例,因为行读取,列拆分和匹配的机制都是内置的。唯一棘手的问题是将命令行参数添加到脚本中,但这并不算太糟糕:

#!/usr/bin/env bash
awk -v window="$1" -F, '
BEGIN { if (window=="") {window = 1} }

$2=="disconnected"{if (consecutive==0){start=NR}; consecutive++}
$2!="disconnected"{if (consecutive>window){print start}; consecutive=0}

END {if (consecutive>window){print start}}'

window值作为第一个命令行参数提供;省略,它默认为1,这意味着“显示至少两次连续断开的间隙的开始”。可能有一个更好的名字。您可以将其指定为0以包含单个断开连接。下面的示例输出。 (注意,我在最后添加了一系列2个断开连接以测试Charles提到的故障)。

njv@organon:~/tmp$ ./tst.sh 0 < status.txt # any number of disconnections
7
17
21
31
njv@organon:~/tmp$ ./tst.sh < status.txt # at least 2 disconnections
7
21
31
njv@organon:~/tmp$ ./tst.sh 8 < status.txt # at least 9 disconnections
21

2
投票

Awk解决方案:

detector.awk脚本:

#!/bin/awk -f

BEGIN { FS="," }
$2 == "disconnected"{ 
    if (f && NR-c==nr) c++; 
    else { f=1; c++; nr=NR } 
}
$2 == "connected"{ 
    if (f) { 
        if (c > n) { 
            printf "- Missing more than 5 \042connected\042 starting at %d.\n", nr 
        } 
        f=c=0 
    } 
}

用法:

awk -f detector.awk -v n=5 status.txt

输出:

- Missing more than 5 "connected" starting at 7.
- Missing more than 5 "connected" starting at 21.
© www.soinside.com 2019 - 2024. All rights reserved.