可以连接两个使用不同输入记录分隔符的perl脚本

问题描述 投票:2回答:2

两个使用不同输入记录分隔符的perl脚本一起工作,将LaTeX文件转换为易于搜索的人类可读短语和句子。当然,它们可以通过单个shell脚本包装在一起。但我很好奇它们是否可以合并到一个perl脚本中。

这些脚本的原因:例如,在short.tex中找到“two three”会很麻烦。但转换后,grep'two three'将返回第一段。

对于任何LaTeX文件(此处为short.tex),将按如下方式调用脚本。

cat short.tex | try1.pl | try2.pl

try1.pl适用于段落。它摆脱了LaTeX的评论。它确保每个单词与其邻居分隔一个空格,这样就不会隐藏在单词之间的偷偷摸摸的标签,换页等。结果段落占用一行,由单个空格分隔的可见字符组成---最后是至少两个换行符的序列。

try2.pl啜饮整个文件。它确保段落之间只有两个换行符。并且它确保文件的最后一行是非平凡的,包含可见字符。

可以优雅地将两个依赖于不同输入记录分隔符的操作连接成一个perl脚本,比如big.pl吗?例如,try1.pl和try2.pl的工作是否可以通过较大脚本中的两个函数或括号内的段来完成?

顺便说一下,“输入记录分隔符”是否有stackoverflow关键字?

try1.评论:

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
    s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with % 
    s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
    s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline; 
    s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
    print;
    print "\n\n"; # make sure each paragraph is separated from its fellows by newlines 
}

try2.评论:

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
    s/[\n][\n]+/\n\n/g; # exactly 2 blank lines separate paragraphs. Like cat -s
    s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
    print;
}

short.tex:

\paragraph{One}
% comment 
two % also 2
three % or 3

% comment
% comment

% comment
% comment

% comment

% comment

So they said%
that they had done it.

% comment
% comment
% comment





Fleas.

% comment

% comment




转换后:

\paragraph{One} two three 

So they said that they had done it.

Fleas.
perl concatenation record paragraph slurp
2个回答
1
投票

要将try1.pltry2.pl组合成一个脚本,您可以尝试:

local $/ = "";
my @lines;
while (<>) {
    [...]    # Same code as in try1.pl except print statements
    push @lines, $_;
}

$lines[-1] =~ s/\n+$/\n/;
print for @lines;

1
投票

激动人心的问题是生成一个LaTeX文件的“清理”版本,使用正则表达式可以很容易地搜索复杂的短语或句子。

下面的单个perl脚本完成了这项工作,而之前我需要一个shell脚本和两个perl脚本,需要三次调用perl。这个新的单个脚本包含三个连续循环,每个循环都有一个不同的输入记录分隔符。

  1. 第一个循环:input = STDIN,或作为参数传递的文件;记录separator = default,逐行;将结果打印到fileafterperlLIN,硬盘驱动器上的临时文件。
  2. 第二个循环:input = fileafterperlLIN;记录separator =“”,逐段;将结果打印到fileafterperlPRG,硬盘驱动器上的临时文件。
  3. 第三循环:input = fileafterperlPRG;记录separator = undef,将整个文件打印结果粘贴到STDOUT

这样做的缺点是打印到硬盘驱动器上的两个文件并从中读取,这可能会降低它的速度。优点是操作似乎只需要一个过程;并且所有代码都驻留在一个文件中,这样可以更容易维护。

#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s 
use strict;
use warnings;
use 5.18.2;
my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName =  $ENV{'LOGNAME'}; 
my $scriptpath; 
my $scriptname; 
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding; 
my $count; 
sub diagnosticmessage {
    return unless ( $diagnose );
    print STDERR "$scriptname: ";
    foreach $diagnosticstring (@_) {
        printf STDERR "$diagnosticstring\n";
    }
}

# routine setup 
$scriptpath= $0;
$scriptname=$scriptpath; 
$scriptname=~s|.*\x2f([^\x2f]+)$|$1|;
$cdld="$ENV{'cdld'}"; # a directory to hold temporary files used by scripts
$exitcode=system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$scriptdirectory="$cdld/$scriptname"; # to hold temporary files used by this script 
$exitcode=system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode=system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN="$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode=system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG="$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;

# this script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult 
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because 
#     (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
#     (b) perl needs 2 consecutive newlines to signal end of paragraph. 
# (2) remove all LaTeX comments; 
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be 
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs 
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only 
#     (a) a single space between readable strings, or
#     (b) double newline between paragraphs 
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" ); 
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";
# Loop 1 / line: 
# default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
    $count = $count + 1;
    diagnosticmessage ( "line $count" ); 
    s/^\s*\n/\n/mg; # convert any trivial line to a pure newline. 
    print $handlefileafterperlLIN $_;
}
close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";
# Loop PRG / paragraph: 
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
    $count = $count + 1;
    diagnosticmessage ( "paragraph $count" );
    s/(?<!\x5c)[\x25].*\n/ /g; # remove all LaTeX comments.
   #    They start with % not \% and extend to end of line or newline character. Join to next line.
    #   s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something. 
    #   This would be only if we are going to run detex on the output.
    s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
    s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
    s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
    s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
    s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
    s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
    print $handlefileafterperlPRG $_;
    print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);

open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";
# Loop slurp 
local $/ = undef;   # input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
    $count = $count + 1;
    diagnosticmessage ( "slurp $count" );
    s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
    s/[\n]+$/\n/;           # last line is visible or "nontrivial"; no trivial (blank) line at the end
    s/^[\n]+//;             # no trivial (blank) line at the start. The first line is "nontrivial."
    print STDOUT;
}
© www.soinside.com 2019 - 2024. All rights reserved.