使用PHP检测EOL类型

Question

参考：这是一个自我回答问题。这是为了分享知识，Q＆A的风格。

如何检测end of line字符的PHP类型？

PS：我已经从头开始编写这些代码的时间太长了，所以我决定在SO分享，再加上，我敢肯定有人会找到改善的方法。

Answer 1

/**
 * Detects the end-of-line character of a string.
 * @param string $str The string to check.
 * @param string $default Default EOL (if not detected).
 * @return string The detected EOL, or default one.
 */
function detectEol($str, $default=''){
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        //"\0x76",       // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
    );
    $cur_cnt = 0;
    $cur_eol = $default;
    foreach($eols as $eol){
        if(($count = substr_count($str, $eol)) > $cur_cnt){
            $cur_cnt = $count;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}

笔记：

需要检查的编码类型
需要以某种方式知道我们可能是一个奇异的系统，如ZX8x上（因为ASCII X76是一个普通的字母） @radu提出了一个很好的点，于我而言，这是不值得很好地处理ZX8x系统的努力。
我应该功能分成两个？ mb_detect_eol()（多字节）和detect_eol()

Answer 2

那岂不是更容易只需更换everything except new lines using regex？

该点与单个字符匹配，而不关心这个角色是什么。唯一的例外是换行符。

考虑到这一点，我们做了一些魔法：

$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);

不知道是否我们可以相信正则表达式来做到这一切，但我没有任何与测试。

Answer 3

此时此地已经给出答案提供足够的信息给用户。下面的代码（基于已经给anwers）可能更有帮助：

它提供了发现EOL的基准

检测还设置其可以通过应用该参考文献中使用的密钥。

它显示了如何使用一个工具类的引用。

演示如何使用它来检测返回发现EOL的键名的文件。

I hope this is of usage to all of you.

/**
Newline characters in different Operating Systems
The names given to the different sequences are:
============================================================================================
NewL  Chars       Name     Description
----- ----------- -------- ------------------------------------------------------------------
LF    0x0A        UNIX     Apple OSX, UNIX, Linux
CR    0x0D        TRS80    Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
LFCR  0x0A 0x0D   ACORN    Acorn BBC and RISC OS spooled text output.
CRLF  0x0D 0x0A   WINDOWS  Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
                          and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
----- ----------- -------- ------------------------------------------------------------------
*/
const EOL_UNIX    = 'lf';        // Code: \n
const EOL_TRS80   = 'cr';        // Code: \r
const EOL_ACORN   = 'lfcr';      // Code: \n \r
const EOL_WINDOWS = 'crlf';      // Code: \r \n

然后使用下面的代码在一个静态类应用程序检测

/**
Detects the end-of-line character of a string.
@param string $str      The string to check.
@param string $key      [io] Name of the detected eol key.
@return string The detected EOL, or default one.
*/
public static function detectEOL($str, &$key) {
   static $eols = array(
     Util::EOL_ACORN   => "\n\r",  // 0x0A - 0x0D - acorn BBC
     Util::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
     Util::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
     Util::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
  );

  $key = "";
  $curCount = 0;
  $curEol = '';
  foreach($eols as $k => $eol) {
     if( ($count = substr_count($str, $eol)) > $curCount) {
        $curCount = $count;
        $curEol = $eol;
        $key = $k;
     }
  }
  return $curEol;
}  // detectEOL

然后一个文件：

/**
Detects the EOL of an file by checking the first line.
@param string  $fileName    File to be tested (full pathname).
@return boolean false | Used key = enum('cr', 'lf', crlf').
@uses detectEOL
*/
public static function detectFileEOL($fileName) {
   if (!file_exists($fileName)) {
     return false;
   }

   // Gets the line length
   $handle = @fopen($fileName, "r");
   if ($handle === false) {
      return false;
   }
   $line = fgets($handle);
   $key = "";
   <Your-Class-Name>::detectEOL($line, $key);

   return $key;
}  // detectFileEOL

改变你的类名到您的名字为实现类（所有静态成员）。

Answer 4

我的回答，因为我可以做既不ohaal的一个或transilvlad的一两件事，就是：

function detect_newline_type($content) {
    $arr = array_count_values(
               explode(
                   ' ',
                   preg_replace(
                       '/[^\r\n]*(\r\n|\n|\r)/',
                       '\1 ',
                       $content
                   )
               )
           );
    arsort($arr);
    return key($arr);
}

说明：

在这两个提议的解决方案的总体思路是好的，但实施细则阻碍这些答案的有效性。

事实上，这个功能点是回到那种在一个文件中使用换行符，而换行可以是一个或两个字符长。

仅这一点就使得使用str_split()不正确的。正确地切割令牌的唯一方法是使用切割的字符串具有可变的长度，基于字符检测代替的功能。也就是说，当explode()进场。

但是，给有用的标记爆炸，有必要更换正确的字符，在适量，用正确的匹配。而最神奇的发生在正则表达式。

3点必须考虑：

使用.*的建议通过ohaal将无法正常工作。虽然这是事实，.不会匹配换行符，系统在哪里\r不是一个换行符，或一个换行符的一部分，.将不正确（符合它的提醒：我们检测换行，因为他们可能是从那些不同我们的系统上，否则没有任何意义）。
用什么代替/[^\r\n]*/将“工作”，以使文本消失，但只要我们想有一个分隔符（因为我们删除所有字符，但换行符，这不是一个新行会是一个有效的任何字符是一个问题分隔器）。因此，思想创造与换行符匹配，并使用反向引用那场比赛中更换。
这可能是在内容上，多换行会在一排。但是，我们不希望他们组在这种情况下，因为它们会通过代码不同类型的换行的其余部分可以看出。这就是为什么新行的列表中匹配的反向引用被明确说明。

Answer 5

基于ohaal的答案。

这可以返回一个或两个字符像LF，CR + LF EOL ..

  $eols = array_count_values(str_split(preg_replace("/[^\r\n]/", "", $string)));
  $eola = array_keys($eols, max($eols));
  $eol = implode("", $eola);

使用PHP检测EOL类型

问题描述投票：11回答：5

5个回答

说明：

最新问题

使用PHP检测EOL类型

问题描述 投票：11回答：5

5个回答

说明：

最新问题

问题描述投票：11回答：5