替换数百万正则表达式（perl）

Question

我有一个包含超过一百万行文本的文本文件。在每一行上都有一个字母数字代码，需要用名字代替。我尝试使用不同的Perl脚本执行此操作，但每次脚本因为使用太多内存而死亡。我是Perl的新手，所以我想我做错了什么，这使得工作太复杂了？到目前为止，我尝试过：

use strict;
use warnings;

my $filename = 'names.txt';

my $data = read_file($filename);

$data =~ s/88tx0p/Author1/g;
##and then there are 1,000,000+ other substitution regexes.

write_file($filename, $data);
exit;

sub read_file {
my ($filename) = @_;

open my $in, '<:encoding(UTF-8)', $filename or die "Could not open 
'$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;

return $all;
}

sub write_file {
my ($filename, $content) = @_;

open my $out, '>:encoding(UTF-8)', $filename or die "Could not open 
'$filename' for writing $!";;
print $out $content;
close $out;

return;
}

但后来我意识到这个脚本正在尝试将输出写入原始文件，我想这会使用更多的内存？所以我尝试了以下方法：

use strict;
use utf8;
use warnings;

open(FILE, 'names.txt') || die "File not found";
my @lines = <FILE>;
close(FILE);

my @newlines;
foreach(@lines) {
$_ =~ s/88tx0p/Author1/g;
##and then there are approximately 1,000,000 other substitution regexes.
push(@newlines,$_);
}

open(FILE, '>names_edited.txt') || die "File not found";
;
print FILE @newlines;
close(FILE);

但同样，这使用了太多的记忆。在使用最少的内存时，我可以获得有关这样做的方法的帮助吗？谢谢你们。

Answer 1

你的问题是你正在使用foreach循环。这需要你将所有行加载到内存中，这是你问题的根源。

在while循环中尝试：

open ( my $file, '<', 'names.txt' ) or die $!; 
open ( my $output, '>', 'names_edited.txt' ) or die $!;
select $output; #destination for print; 
while ( <$file> ) {  #reads one line at a time, sets $_
    s/88tx0p/Author1/g;   #acts on $_ by default
    print; #defaults to printing $_ to the selected filehandle $output
}

这将逐行工作（作为您的初始代码），但一次只能读取一行，因此内存占用将大大降低。

替换数百万正则表达式（perl）

问题描述投票：0回答：1

1个回答

最新问题

替换数百万正则表达式（perl）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1