使用 PHP 从 TMX (XML) 内容中提取标签

问题描述 投票:0回答:1

我正在构建一个基于浏览器的 TMX(翻译记忆库)编辑器。当源片段和/或目标片段包含标签时,我的内容提取脚本会中断。包含标签的源/目标字符串如下所示:

<tuv xml:lang="BG">
  <seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
  <seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>

如果源/目标片段中没有标签,提取很容易:

$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];

但是当标签存在(处于初始位置)时,我陷入了困境。特别是因为这些标签被视为数组。

我不知道如何继续:我可以检查源/目标字符串是否包含/是一个数组,但不确定下一步要做什么。最终,我需要将文本与标签一起打印,以便用户编辑文本并在必要时移动/删除标签。

这是我的测试代码:

$uploadedFile = "user_files/sample_tmx.tmx";

$xmlStr = file_get_contents($uploadedFile);
$xmlObj = simplexml_load_string($xmlStr);
$arrXml = $util->objectsIntoArray($xmlObj);
$TUs = $arrXml['body']['tu'];

$pattern = '~<.*?>~';

foreach($TUs as $TU) {

    if(is_array($TU['tuv'][0]['seg'])) {

        var_dump($TU['tuv'][0]['seg']);

        $pattern = '~<.*?>~';
        $uncleanSource = $TU['tuv'][0]['seg'];
        $uncleanTarget = $TU['tuv'][1]['seg'];

        foreach($uncleanSource as $unclean) {
            //var_dump($unclean);
        }

        //$sourceText = preg_replace($pattern, "&lt;TAG&gt;", $uncleanSource);
        //$targetText = preg_replace($pattern, "&lt;TAG&gt;", $uncleanTarget);
    }
    else {
        $sourceText = $TU['tuv'][0]['seg'];
        $targetText = $TU['tuv'][1]['seg'];
    }

    echo "<p>".$sourceText." = ".$targetText."</p>";
}

这是sample_tmx.tmx 文件内容:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd" >
<tmx version="1.4">
<header creationtool="TDC Analysis Package" creationtoolversion="org.gs4tr.tm3.tmx.Version" segtype="sentence" o-tmf="unknown" adminlang="EN-US" srclang="BG" datatype="unknown" creationdate="20221006T184234Z" >
</header>
<body>
<tu creationdate="20201101T133734Z" creationid="Cheeseus" changedate="20201103T151745Z" changeid="Cheeseus" usagecount="0">
    <prop type="nextMd5Checksum">e82d0ed6d711aa59310d1e8f4478537e</prop>
    <prop type="previousMd5Checksum">39b279e324e9f6cd27351287502eefcb</prop>
    <tuv xml:lang="BG">
      <seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
    </tuv>
    <tuv xml:lang="EN-GB">
      <seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
    </tuv>
  </tu>
  <tu creationdate="20080812T111221Z" creationid="Cheeseus" changedate="20190825T065920Z" changeid="Cheeseus" usagecount="0">
    <tuv xml:lang="BG">
      <seg>ПАРТНЬОРИ</seg>
    </tuv>
    <tuv xml:lang="EN-GB">
      <seg>PARTNERS</seg>
    </tuv>
  </tu>
</body>
</tmx>
php xml-parsing
1个回答
0
投票

问题似乎来自

objectsIntoArray()
,没有代码很难修复。

如果您删除该调用并使用 SimpleXML 元素,因为它们的预期用途,您可以使用以下方式调用它(删除其他代码以关注此问题)...

$xmlObj = simplexml_load_file($uploadedFile);

foreach($xmlObj->body->tu as $TU) {
    if(isset($TU->tuv[0]->seg)) {
        echo $TU->tuv[0]->seg->asXML();
    }
}

使用

asXML
方法将按原样再现内容,将所有子元素扩展回值。这将给

<seg>
                    <bpt x="1" i="1" type="italic"/>
Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/>
                </seg>

<seg>ПАРТНЬОРИ</seg>

您可能需要做更多的工作才能产生您想要的结果,但希望这能展示如何实现这一目标。

© www.soinside.com 2019 - 2024. All rights reserved.