将多字节字符串截断为n个字符

问题描述 投票:7回答:4

我正在尝试使此方法在字符串过滤器中运行:

public function truncate($string, $chars = 50, $terminator = ' …');

我希望这个

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

还有这个

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …";

$chars减去$terminator字符串的字符。

此外,过滤器应该在$chars限制以下的第一个单词边界处剪切,例如

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

我很确定这应该适用于这些步骤

  • 从最大字符中减去终结符中的字符数
  • 验证该字符串长于计算出的限制或将其保留不变
  • 找到字符串中最后的空格字符,使其低于计算的限制以获取单词边界
  • 剪切字符串的最后一个空格或如果没有找到最后一个空格则计算出的限制
  • 将终止符附加到字符串
  • 返回字符串

但是,我现在尝试了str*mb_*函数的各种组合,但是均产生了错误的结果。这不是那么困难,所以我显然缺少了一些东西。有人会为此共享一个可行的实现吗?[[或将我指向一个资源,使我最终可以理解该方法。

谢谢

P.S。是的,我之前已经检查过https://stackoverflow.com/search?q=truncate+string+php:)

php string truncate multibyte
4个回答
3
投票
尝试一下:

function truncate($string, $chars = 50, $terminator = ' …') { $cutPos = $chars - mb_strlen($terminator); $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' '); return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator; }

但是您需要确保正确设置内部编码。

5
投票
刚刚发现PHP已经使用了多字节截断

  • 尽管它不遵守单词边界。但是仍然方便!

  • 0
    投票
    我通常不喜欢为这样的问题编写完整的答案。但是我也刚醒来,我想也许您的问题会让我心情愉快,以便在剩余的时间里继续编程。

    我没有尝试运行此程序,但是它应该可以运行,或者至少可以让您90%地达到目标。

    mb_strimwidth


    0
    投票
    tldr;

      足够短的字符串不应附加省略号。
    • 换行符也应该是合格的断点。
    • 正则表达式,一旦分解并得到解释,就不会太吓人。

  • 我认为,关于这个问题和当前的答案,有一些重要的事情要指出。我将基于戈登的样本数据和一些其他案例演示答案与我的正则表达式答案的比较,以揭示一些不同的结果。

    首先,要澄清输入值的质量。戈登说,该功能必须是多字节安全的,并遵守字边界。样本数据在确定截断位置时并未暴露对非空格,非单词字符(例如标点符号)的期望处理,因此我们必须假设以空格字符为目标已经足够了,并且明智的做法是,因为大多数“阅读更多内容”字符串在截断时不必担心遵守标点符号。

    第二,在相当普遍的情况下,必须对包含换行符的大量文本使用省略号。

    第三,让我们随意同意一些基本的数据标准化,例如:

      字符串已经被裁剪掉所有前导/后缀的空白字符
  • function truncate( $string, $chars = 50, $terminate = ' ...' ) { $chars -= mb_strlen($terminate); if ( $chars <= 0 ) return $terminate; $string = mb_substr($string, 0, $chars); $space = mb_strrpos($string, ' '); if ($space < mb_strlen($string) / 2) return $string . $terminate; else return mb_substr($string, 0, $space) . $terminate; } 的值将始终大于$charsmb_strlen()
  • $terminator

    功能:

    Demo

    测试用例:

    function truncateGumbo($string, $chars = 50, $terminator = ' …') { $cutPos = $chars - mb_strlen($terminator); $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' '); return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator; } function truncateGordon($string, $chars = 50, $terminator = ' …') { return mb_strimwidth($string, 0, $chars, $terminator); } function truncateSoapBox($string, $chars = 50, $terminate = ' …') { $chars -= mb_strlen($terminate); if ( $chars <= 0 ) return $terminate; $string = mb_substr($string, 0, $chars); $space = mb_strrpos($string, ' '); if ($space < mb_strlen($string) / 2) return $string . $terminate; else return mb_substr($string, 0, $space) . $terminate; } function truncateMickmackusa($string, $max = 50, $terminator = ' …') { $trunc = $max - mb_strlen($terminator, 'UTF-8'); return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string); }

    执行:

    $tests = [ [ 'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.", // 50th char ---------------------------------------------------^ 'expected' => "Answer to the Ultimate Question of Life, the …", ], [ 'testCase' => "A single line of text to be followed by another\nline of text", // 50th char ----------------------------------------------------^ 'expected' => "A single line of text to be followed by another …", ], [ 'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ", // 50th char ---------------------------------------------------^ 'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …", ], [ 'testCase' => "123456789 123456789 123456789 123456789 123456789", // 50th char doesn't exist -------------------------------------^ 'expected' => "1234567890123456789012345678901234567890123456789", ], [ 'testCase' => "Hello worldly world", // 50th char doesn't exist -------------------------------------^ 'expected' => "Hello worldly world", ], [ 'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890", // 50th char ---------------------------------------------------^ 'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …", ], ];

    输出:

    foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) { echo "\tSample Input:\t\t$testCase\n"; echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase); echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase); echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase); echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase); echo "\n\tExpected Result:\t{$expected}"; echo "\n-----------------------------------------------------\n"; }

    我的模式说明:

    尽管看起来确实很难看,但是大多数乱码模式语法都是将数字值插入为动态量词的问题。

    我也可以写成:

    Sample Input: Answer to the Ultimate Question of Life, the Universe, and Everything. truncateGumbo: Answer to the Ultimate Question of Life, the … truncateGordon: Answer to the Ultimate Question of Life, the Uni … truncateSoapBox: Answer to the Ultimate Question of Life, the … truncateMickmackusa: Answer to the Ultimate Question of Life, the … Expected Result: Answer to the Ultimate Question of Life, the … ----------------------------------------------------- Sample Input: A single line of text to be followed by another line of text truncateGumbo: A single line of text to be followed by … truncateGordon: A single line of text to be followed by another … truncateSoapBox: A single line of text to be followed by … truncateMickmackusa: A single line of text to be followed by another … Expected Result: A single line of text to be followed by another … ----------------------------------------------------- Sample Input: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ truncateGumbo: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ … truncateGordon: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ … truncateSoapBox: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ … truncateMickmackusa: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ … Expected Result: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ … ----------------------------------------------------- Sample Input: 123456789 123456789 123456789 123456789 123456789 truncateGumbo: 123456789 123456789 123456789 123456789 12345678 … truncateGordon: 123456789 123456789 123456789 123456789 123456789 truncateSoapBox: 123456789 123456789 123456789 123456789 … truncateMickmackusa: 123456789 123456789 123456789 123456789 123456789 Expected Result: 1234567890123456789012345678901234567890123456789 ----------------------------------------------------- Sample Input: Hello worldly world truncateGumbo: Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4 Hello worldly world … truncateGordon: Hello worldly world truncateSoapBox: Hello worldly … truncateMickmackusa: Hello worldly world Expected Result: Hello worldly world ----------------------------------------------------- Sample Input: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890 truncateGumbo: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV … truncateGordon: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV … truncateSoapBox: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV … truncateMickmackusa: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV … Expected Result: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV … -----------------------------------------------------

    为简单起见,我将'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'
    替换为$trunc,将48替换为$max

    50

  • © www.soinside.com 2019 - 2024. All rights reserved.