从列中获取子字符串以获取文本，直到第 3 次出现“/”

Question

我到处搜索，但找不到与我的问题完全相似的解决方案。

在 Bash 中，我有一个以制表符分隔的文件。它可能有几百万行。在第 27 列中有一串由正斜杠分隔的颜色。我的最终目标是修剪文件的第 27 列，以便仅保留前三种颜色，并删除该列中的其余颜色。

即

    column1.    column2.    column 3.    colors
        abc.        abc.         abc.    green/yellow/red/orange/blue

应该变成：

    column1.    column2.   column 3.   colors
        abc.        abc.        abc.   green/yellow/red

我一直在尝试使用 awk 来完成此操作，但恐怕我似乎无法让它工作。这是我的尝试：

awk 'NR>1 BEGIN{FS=OFS="\t"} {gsub(/^(?:[^\/]*[\/]){2}[^\/]*(.*)/,"",$27); print $0}' ${filename} > "${filename}.tmp" && mv "${filename}.tmp" "${filename}"

我对正则表达式非常不熟悉，这正是我可以在正则表达式构建器站点上工作的内容，但仍然不确定这是否正确。再次澄清一下，我希望所有其他列保持原样，但我只想修剪颜色列（第 27 列），以便只保留前 3 种颜色。这个文件可能会变得很大，所以我希望尽可能将它保存在一个命令（例如 awk）中，这样我就不会放慢速度。

Answer 1

给定此输入文件在目标列中具有不同数量的颜色，以更好地测试 OP 中提供的实际要求评论：

$ cat file
column1.        column2.        column3.        colors
abc.    abc.    abc.    green/yellow/red/orange/blue
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow
abc.    abc.    abc.    green
abc.    abc.    abc.

然后使用 GNU awk 作为第三个参数到

match()

：

$ awk 'BEGIN{FS=OFS="\t"} match($4,"([^/]*/){2}[^/]*",a){$4=a[0]} 1' file
column1.        column2.        column3.        colors
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow
abc.    abc.    abc.    green
abc.    abc.    abc.

或使用任何 POSIX awk:

$ awk 'BEGIN{FS=OFS="\t"} match($4,"([^/]*/){2}[^/]*"){$4=substr($4,1,RLENGTH)} 1' file
column1.        column2.        column3.        colors
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow
abc.    abc.    abc.    green
abc.    abc.    abc.

无论您的目标列中有多少种颜色，以上方法都有效。

Answer 2

给定：

$ cat file
column1.    column2.    column 3.   colors
abc.    abc.    abc.    green
abc.    abc.    abc.    green/yellow
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow/red/orange/blue

你可以这样做：

awk  'BEGIN{FS=OFS="\t"}
split($4,a,"/")>3{$4=a[1] "/" a[2] "/" a[3]} 1' file

将

$4

设置为要更改的col ...

如果你有可变数量的颜色并且想要打印最多

max

你可以这样做：

awk  '
BEGIN{FS=OFS="\t"; max=3}
split($4,a,"/")>2{
        s=a[1]
        for(i=2; i<=length(a) && i<=max; i++) s=s "/" a[i]
        $4=s
} 1' file

有了那个输入，这些印刷品中的任何一个：

column1.    column2.    column 3.   colors
abc.    abc.    abc.    green
abc.    abc.    abc.    green/yellow
abc.    abc.    abc.    green/yellow/red
abc.    abc.    abc.    green/yellow/red

Answer 3

如果允许使用 Perl：

$ perl -pe 's@\b(\w+/\w+/\w+).*@$1@' file
    column1.    column2.    column 3.    colors
        abc.        abc.         abc.    green/yellow/red

Answer 4

我想解释一下为什么你的尝试失败了。

我可以在正则表达式构建器网站上工作，但仍然没有确定这是否正确

首先，有不同的flavors，参见正则表达式引擎比较图表概述和比较它有什么特点。

{gsub(/^(?:[^\/]*[\/]){2}[^\/]*(.*)/,"",$27); print $0}

您正在尝试使用

(?:regex) (non-capturing group)

，因为我们可以从链接站点中了解到

AWK

确实使用了

POSIX ERE

，并且如图所示，它不支持该特定功能。

Answer 5

试试这个，使用 perl

perl -ne 'BEGIN { $filename = "input.csv"; open($in, "<", $filename) or die "Cannot open $filename: $!"; open($out, ">", "$filename.tmp") or die "Cannot open $filename.tmp: $!"; } chomp; if ($. == 1) { print $out "$_\n"; next; } @fields = split("\t", $_); @colors = split("\/", $fields[26]); $fields[3] = join("\/", @colors[0..3]); print $out join("\t", @fields) . "\n"; END { close $in; close $out; }' input.csv

输出：

column1.    column2.    column 3.   colors
abc.    abc.    abc.    /green/yellow/red
abc.    abc.    abc.    /grewn/yeldow/red
abc.    abc.    abc.    /grecn/yelvow/red
abc.    abc.    abc.    /grezn/yelfow/red
abc.    abc.    abc.    /greqn/yelwow/red

从列中获取子字符串以获取文本，直到第 3 次出现“/”

问题描述投票：0回答：5

5个回答

最新问题

从列中获取子字符串以获取文本，直到第 3 次出现“/”

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5