我正在尝试从字符串中提取所有以逗号分隔的条目,但我遇到了一些在引用数据或括号内包含逗号的值的问题。我尝试使用 Parse::CSV 和 Text::ParseWords 但它没有捕获括号之间的逗号分隔的内容。我还有一段代码,它的作用相反,它找到括号之间的逗号,但错过了引号之间的逗号。我接管了这个代码,所以我不确定如何添加到正则表达式块。我附上代码以及响应。有 2 个示例,一个是 Text::ParseWords 示例,另一个是 regex 示例。如有任何帮助,我们将不胜感激。
my $string1 = "10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT";
my $string2 = "10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)";
for (my $i=0; $i<2; $i++) {
my $string;
$string = $string1 if ($i == 0);
$string = $string2 if ($i == 1);
my @fields1 = ();
my @fields2 = ();
@fields1 = parse_line(q{,}, 0, $string);
push @fields2, $1 while $string =~ /
((?:
[^(),]+ |
( \(
(?: [^()]+ | (?2) )*
\) )
)*)
(?: ,\s* | $)
/xg;
print "\n\nProcessing String - $string\n";
my $cnt1 = 0;
print "\n";
foreach my $field (@fields1) {
$field =~ s/^\s+|\s+$//g;
print "\nExample 1: field $cnt1 is $field\n";
$cnt1++;
}
my $cnt2 = 0;
print "\n";
foreach my $field (@fields2) {
$field =~ s/^\s+|\s+$//g;
print "\nExample 2: field $cnt2 is $field\n";
$cnt2++;
}
}
这是回复
Processing String - 10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT
Example 1: field 0 is 10507
Example 1: field 1 is STAGE
Example 1: field 2 is SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM
Example 1: field 3 is EXPORT_LOG_MSG
Example 1: field 4 is EXPORT_LOG_MSG_COUNT
Example 2: field 0 is 10507
Example 2: field 1 is STAGE
Example 2: field 2 is 'SELECT CNTC_KEY
Example 2: field 3 is CUST_FST_NME
Example 2: field 4 is CUST_LST_NM'
Example 2: field 5 is EXPORT_LOG_MSG
Example 2: field 6 is EXPORT_LOG_MSG_COUNT
Example 2: field 7 is
Processing String - 10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)
Example 1: field 0 is 10507
Example 1: field 1 is STAGE
Example 1: field 2 is SELECT,PRTNR_CNTC_KEY
Example 1: field 3 is SQL_EXPORT_COUNT
Example 1: field 4 is EXPORT_LOG_MSG
Example 1: field 5 is NAME(FIRST
Example 1: field 6 is LAST)
Example 2: field 0 is 10507
Example 2: field 1 is STAGE
Example 2: field 2 is 'SELECT
Example 2: field 3 is PRTNR_CNTC_KEY'
Example 2: field 4 is SQL_EXPORT_COUNT
Example 2: field 5 is EXPORT_LOG_MSG
Example 2: field 6 is NAME(FIRST,LAST)
Example 2: field 7 is
你的字符串太多了,我已经将其简化以很好地适应屏幕,因为它是给你带来问题的最后一个字段:
10507, STAGE, 'SELECT CNTC_KEY,CUST_LST_NM', EXPORT_LOG_MSG_COUNT
10507, STAGE, 'SELECT,PRTNR_CNTC_KEY', NAME(FIRST,LAST)
处理第一个示例很容易,因为单引号界定了第三个字段的范围。第二个更难,因为该字段没有任何东西可以界定它。没有什么可以告诉任何通用解析器
NAME(FIRST,LAST)
应该是一回事。
我认为没有办法让正则表达式扭曲人们的思想。看起来像
NAME(FIRST,LAST)
这样的字段可以出现在任何位置。也许还有很多其他特殊情况。
对于这些事情,我希望约束足够简单,我可以编写一个解析器(如果你查看 Text::CSV_XS 内部,你会看到同样的事情发生)。基本上,你沿着绳子走并记住你在做什么。根据您接下来看到的内容,您会做一些事情并可能影响某些状态。你继续这样做,直到完成。
这个子例程在标量上下文中使用 Perl 的全局匹配与
/gc/
标志,一旦您熟悉了它,这是一个强大的功能(也许我在 Mastering Perl 中介绍了这一点。/g
是全局位,并且是条件中的提供标量上下文。/c
是技巧:如果匹配失败,它告诉 Perl 不要重置匹配位置,而是在当前位置开始对同一字符串进行下一个匹配。\G
锚定到比赛的起始位置。请参阅 pos 文档。
sub parse_line {
local $_ = shift;
my @parts = ();
my $found_field;
while( 1 ) {
if( /\G $/gcx ) { # at the end
push @parts, '' unless $found_field;
last;
}
elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
push @parts, '' unless $found_field;
$found_field = 0;
}
elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
push @parts, $+{field};
$found_field = 1;
}
elsif( /\G \( (?<field> .*? ) (?<!\\) \) /gcxp ) { # partial field with parens
if( $found_field ) { $parts[-1] .= ${^MATCH} }
else { push @parts, ${^MATCH} }
$found_field = 1;
}
elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
push @parts, ${^MATCH};
$found_field = 1;
}
else {
Carp::carp "Parsing ended abruptly at pos " . pos;
last;
}
}
return \@parts
}
在该代码中,如果您想要处理其他情况,则可以插入其他分支。例如,如果您正在使用 Excel,并且它是转义双引号 (
"field "" <--- embedded"
) 的疯狂方式,那么您必须以其他方式处理该问题。这就是这个问题一般来说如此困难的原因。如果你可以在本地限制它,它会变得更容易。
而且,我想我可能已经做了一些困难的事情,但我只是努力工作,所以你必须忍受:)
这是一个测试程序,涵盖了我能立即想到的极端情况。
use v5.26;
use Carp;
use Data::Dumper qw(Dumper);
use Test::More;
my @table = (
[
'embedded commas in single quotes',
q(10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
[
'10507',
'STAGE',
'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
'EXPORT_LOG_MSG',
'EXPORT_LOG_MSG_COUNT'
],
],
[
'embedded commas in double quotes',
q(10507, STAGE, "SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM",EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
[
'10507',
'STAGE',
'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
'EXPORT_LOG_MSG',
'EXPORT_LOG_MSG_COUNT'
],
],
[
'embedded commas in parens, with prefix, last field',
q(10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)),
[
'10507',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG',
'NAME(FIRST,LAST)'
],
],
[
'embedded commas in parens, with prefix, second field',
q(10507, NAME(FIRST,LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'NAME(FIRST,LAST)',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'embedded opening paren in quoted field',
q(10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'NAME(FIRST',
'LAST)',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'escaped quote in quoted field',
q|10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME(FIRST',
'LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'escaped closing paren in paren field',
q|10507, NAME(FIRST, \\) LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME(FIRST, \\) LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'empty parens',
q|10507, NAME(), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME()',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'empty parens only',
q|10507, (), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'()',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'field starts with paren',
q|10507, (FIRST, LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'(FIRST, LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'field is empty',
q(10507, , STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'first field is empty',
q(, STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'all empty',
q(, , ,, ,),
[
'',
'',
'',
'',
'',
'',
],
],
);
foreach my $row ( @table ) {
my( $label, $expected ) = $row->@[0,2];
my $expected = $row->[2];
my $parts = parse_line($row->[1]);
is_deeply $parts, $expected, $label or do {
diag( $row->[1] );
diag( Dumper( $parts ) )
};
}
sub parse_line {
local $_ = shift;
my @parts = ();
my $found_field;
while( 1 ) {
if( /\G $/gcx ) { # at the end
push @parts, '' unless $found_field;
last;
}
elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
push @parts, '' unless $found_field;
$found_field = 0;
}
elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
push @parts, $+{field};
$found_field = 1;
}
elsif( /\G \( (?<field> .*? ) (?<!\\) \) /gcxp ) { # partial field with parens
if( $found_field ) { $parts[-1] .= ${^MATCH} }
else { push @parts, ${^MATCH} }
$found_field = 1;
}
elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
push @parts, ${^MATCH};
$found_field = 1;
}
else {
Carp::carp "Parsing ended abruptly at pos " . pos;
last;
}
}
return \@parts
}
done_testing;