用逗号解析字符串,但忽略引号和括号内的逗号

问题描述 投票:0回答:1

我正在尝试从字符串中提取所有以逗号分隔的条目,但我遇到了一些在引用数据或括号内包含逗号的值的问题。我尝试使用 Parse::CSV 和 Text::ParseWords 但它没有捕获括号之间的逗号分隔的内容。我还有一段代码,它的作用相反,它找到括号之间的逗号,但错过了引号之间的逗号。我接管了这个代码,所以我不确定如何添加到正则表达式块。我附上代码以及响应。有 2 个示例,一个是 Text::ParseWords 示例,另一个是 regex 示例。如有任何帮助,我们将不胜感激。

my $string1 = "10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT";

my $string2 = "10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)";

for (my $i=0; $i<2; $i++) {
  my $string;
  $string = $string1 if ($i == 0);
  $string = $string2 if ($i == 1);
  my @fields1 = ();
  my @fields2 = ();

  @fields1 = parse_line(q{,}, 0, $string);
  push @fields2, $1 while $string =~ /
                      ((?:
                       [^(),]+ |
                       ( \(
                       (?: [^()]+ | (?2) )*
                       \) )
                       )*)
                       (?: ,\s* | $)
                       /xg;

  print "\n\nProcessing String - $string\n";

  my $cnt1 = 0;
  print "\n";
  foreach my $field (@fields1) {
    $field =~ s/^\s+|\s+$//g;
    print "\nExample 1:  field $cnt1 is $field\n";
    $cnt1++;
  }

  my $cnt2 = 0;
  print "\n";
  foreach my $field (@fields2) {
    $field =~ s/^\s+|\s+$//g;
    print "\nExample 2:  field $cnt2 is $field\n";
    $cnt2++;
  }
}

这是回复

Processing String - 10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT


Example 1:  field 0 is 10507

Example 1:  field 1 is STAGE

Example 1:  field 2 is SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM

Example 1:  field 3 is EXPORT_LOG_MSG

Example 1:  field 4 is EXPORT_LOG_MSG_COUNT


Example 2:  field 0 is 10507

Example 2:  field 1 is STAGE

Example 2:  field 2 is 'SELECT CNTC_KEY

Example 2:  field 3 is CUST_FST_NME

Example 2:  field 4 is CUST_LST_NM'

Example 2:  field 5 is EXPORT_LOG_MSG

Example 2:  field 6 is EXPORT_LOG_MSG_COUNT

Example 2:  field 7 is


Processing String - 10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)


Example 1:  field 0 is 10507

Example 1:  field 1 is STAGE

Example 1:  field 2 is SELECT,PRTNR_CNTC_KEY

Example 1:  field 3 is SQL_EXPORT_COUNT

Example 1:  field 4 is EXPORT_LOG_MSG

Example 1:  field 5 is NAME(FIRST

Example 1:  field 6 is LAST)


Example 2:  field 0 is 10507

Example 2:  field 1 is STAGE

Example 2:  field 2 is 'SELECT

Example 2:  field 3 is PRTNR_CNTC_KEY'

Example 2:  field 4 is SQL_EXPORT_COUNT

Example 2:  field 5 is EXPORT_LOG_MSG

Example 2:  field 6 is NAME(FIRST,LAST)

Example 2:  field 7 is
regex perl
1个回答
0
投票

你的字符串太多了,我已经将其简化以很好地适应屏幕,因为它是给你带来问题的最后一个字段:

10507, STAGE, 'SELECT CNTC_KEY,CUST_LST_NM', EXPORT_LOG_MSG_COUNT

10507, STAGE, 'SELECT,PRTNR_CNTC_KEY', NAME(FIRST,LAST)

处理第一个示例很容易,因为单引号界定了第三个字段的范围。第二个更难,因为该字段没有任何东西可以界定它。没有什么可以告诉任何通用解析器

NAME(FIRST,LAST)
应该是一回事。

我认为没有办法让正则表达式扭曲人们的思想。看起来像

NAME(FIRST,LAST)
这样的字段可以出现在任何位置。也许还有很多其他特殊情况。

对于这些事情,我希望约束足够简单,我可以编写一个解析器(如果你查看 Text::CSV_XS 内部,你会看到同样的事情发生)。基本上,你沿着绳子走并记住你在做什么。根据您接下来看到的内容,您会做一些事情并可能影响某些状态。你继续这样做,直到完成。

这个子例程在标量上下文中使用 Perl 的全局匹配与

/gc/
标志,一旦您熟悉了它,这是一个强大的功能(也许我在 Mastering Perl 中介绍了这一点。
/g
是全局位,并且是条件中的提供标量上下文。
/c
是技巧:如果匹配失败,它告诉 Perl 不要重置匹配位置,而是在当前位置开始对同一字符串进行下一个匹配。
\G
锚定到比赛的起始位置。请参阅 pos 文档。

sub parse_line {
    local $_ = shift;
    my @parts = ();

    my $found_field;
    while( 1 ) {
        if( /\G $/gcx ) {  # at the end
            push @parts, '' unless $found_field;
            last;
            }
        elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
            push @parts, '' unless $found_field;
            $found_field = 0;
            }
        elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
            push @parts, $+{field};
            $found_field = 1;
            }
        elsif( /\G \(  (?<field> .*? )  (?<!\\) \) /gcxp ) { # partial field with parens
            if( $found_field ) { $parts[-1] .= ${^MATCH} }
            else               { push @parts, ${^MATCH} }
            $found_field = 1;
            }
        elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
            push @parts, ${^MATCH};
            $found_field = 1;
            }
        else {
            Carp::carp "Parsing ended abruptly at pos " . pos;
            last;
            }
        }

    return \@parts
    }

在该代码中,如果您想要处理其他情况,则可以插入其他分支。例如,如果您正在使用 Excel,并且它是转义双引号 (

"field "" <--- embedded"
) 的疯狂方式,那么您必须以其他方式处理该问题。这就是这个问题一般来说如此困难的原因。如果你可以在本地限制它,它会变得更容易。

而且,我想我可能已经做了一些困难的事情,但我只是努力工作,所以你必须忍受:)

这是一个测试程序,涵盖了我能立即想到的极端情况。

use v5.26;
use Carp;
use Data::Dumper qw(Dumper);
use Test::More;

my @table = (
    [
        'embedded commas in single quotes',
        q(10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
        [
          '10507',
          'STAGE',
          'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
          'EXPORT_LOG_MSG',
          'EXPORT_LOG_MSG_COUNT'
        ],
    ],
    [
        'embedded commas in double quotes',
        q(10507, STAGE, "SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM",EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
        [
          '10507',
          'STAGE',
          'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
          'EXPORT_LOG_MSG',
          'EXPORT_LOG_MSG_COUNT'
        ],
    ],
    [
        'embedded commas in parens, with prefix, last field',
        q(10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)),
        [
          '10507',
          'STAGE',
          'SELECT,PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG',
          'NAME(FIRST,LAST)'
        ],
    ],
    [
        'embedded commas in parens, with prefix, second field',
        q(10507, NAME(FIRST,LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
        [
          '10507',
          'NAME(FIRST,LAST)',
          'STAGE',
          'SELECT,PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'embedded opening paren in quoted field',
        q(10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
        [
          '10507',
          'NAME(FIRST',
          'LAST)',
          'STAGE',
          'SELECT,PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'escaped quote in quoted field',
        q|10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
        [
          '10507',
          'NAME(FIRST',
          'LAST)',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'escaped closing paren in paren field',
        q|10507, NAME(FIRST, \\) LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
        [
          '10507',
          'NAME(FIRST, \\) LAST)',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'empty parens',
        q|10507, NAME(), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
        [
          '10507',
          'NAME()',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'empty parens only',
        q|10507, (), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
        [
          '10507',
          '()',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'field starts with paren',
        q|10507, (FIRST, LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
        [
          '10507',
          '(FIRST, LAST)',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'field is empty',
        q(10507, , STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
        [
          '10507',
          '',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'first field is empty',
        q(, STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
        [
          '',
          'STAGE',
          'SELECT,\\\'PRTNR_CNTC_KEY',
          'SQL_EXPORT_COUNT',
          'EXPORT_LOG_MSG'
        ],
    ],
    [
        'all empty',
        q(, , ,, ,),
        [
          '',
          '',
          '',
          '',
          '',
          '',
        ],
    ],

    );

foreach my $row ( @table ) {
    my( $label, $expected ) = $row->@[0,2];
    my $expected = $row->[2];

    my $parts = parse_line($row->[1]);

    is_deeply $parts, $expected, $label or do {
        diag( $row->[1] );
        diag( Dumper( $parts ) )
        };
    }

sub parse_line {
    local $_ = shift;
    my @parts = ();

    my $found_field;
    while( 1 ) {
        if( /\G $/gcx ) {  # at the end
            push @parts, '' unless $found_field;
            last;
            }
        elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
            push @parts, '' unless $found_field;
            $found_field = 0;
            }
        elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
            push @parts, $+{field};
            $found_field = 1;
            }
        elsif( /\G \(  (?<field> .*? )  (?<!\\) \) /gcxp ) { # partial field with parens
            if( $found_field ) { $parts[-1] .= ${^MATCH} }
            else               { push @parts, ${^MATCH} }
            $found_field = 1;
            }
        elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
            push @parts, ${^MATCH};
            $found_field = 1;
            }
        else {
            Carp::carp "Parsing ended abruptly at pos " . pos;
            last;
            }
        }

    return \@parts
    }

done_testing;
© www.soinside.com 2019 - 2024. All rights reserved.