我有一个文本或日志文件,通常如下所示:
First line which is also a paragraph.
Another line that is its own paragraph.
etc. etc.
但有时它会溢出到多行段落中:
First line which is also a paragraph.
Another line that is its own paragraph.
Now, this paragraph encompasses more than a single line
with its second line onwards being indented by spaces
to distinguish it from the paragraph-opener, although
it could just as well have been tabs etc.
This is another paragraph.
我想按字典顺序对这些段落进行排序;我不介意它是仅第一行还是整个段落。如果这些是一行行段落 - 那么鲍勃是你的叔叔,我们得到了
sort
。但除此之外我还能做什么呢?
我知道,原则上,我可以:
但这似乎有点麻烦。我可以做得更好吗?
注意:我意识到使用 awk 或 perl 脚本可以以一种简单的方式做到这一点,但答案越接近单行就越好。
单行管道,使用
perl
读取整个文件,并在段落之间插入 0 字节(定义为换行符,紧跟非空白字符),sort
对它们进行排序,最后 tr
从最终输出中再次删除那些 0 字节。基本上是您想法的简单版本。
$ perl -0777 -pe 's/^(?=\S)/\0/gm' input.txt | sort -z | tr -d '\0'
Another line that is its own paragraph.
First line which is also a paragraph.
Now, this paragraph encompasses more than a single line
with its second line onwards being indented by spaces
to distinguish it from the paragraph-opener, although
it could just as well have been tabs etc.
This is another paragraph.
(需要支持
sort
选项的 -z
版本)
或者,如果您可以安装额外的东西,我发现了一个用
perl
编写的漂亮程序,称为 ptp
(通过操作系统包管理器安装(如果可用)或使用 cpan App::PTP
/cpanm App::PTP
/其他首选 CPAN客户):
$ ptp --input-separator '\n(?=\S)' --sort input.txt
Another line that is its own paragraph.
First line which is also a paragraph.
Now, this paragraph encompasses more than a single line
with its second line onwards being indented by spaces
to distinguish it from the paragraph-opener, although
it could just as well have been tabs etc.
This is another paragraph.