使用 MSXML 以 UTF-8 格式保存 XML

问题描述 投票:0回答:4

我正在尝试加载一个简单的 Xml 文件(以 UTF-8 编码):

<?xml version="1.0" encoding="UTF-8"?>
<Test/>

并用vbscript中的MSXML保存:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")

xmlDoc.Load("C:\test.xml")

xmlDoc.Save "C:\test.xml" 

问题是,MSXML 以 ANSI 而不是 UTF-8 保存文件(尽管原始文件是用 UTF-8 编码的)。

MSXML 的 MSDN 文档 表示 save() 将以定义 XML 的任何编码写入文件:

字符编码基于XML声明中的encoding属性,例如.当未指定编码属性时,默认设置为 UTF-8。

但这显然至少在我的机器上不起作用。

MSXML如何保存为UTF-8?

xml localization vbscript utf-8 msxml
4个回答
3
投票

您的 XML 文件中没有任何非 ANSI 文本,因此无论是 UTF-8 还是 ASCII 编码,它都是相同的。在我的测试中,将非 ASCII 文本添加到 test.xml 后,MSXML 始终以 UTF-8 编码保存,并且如果一开始就有 BOM,也会写入 BOM。

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/Byte_order_mark


3
投票

您可以使用 MSXML 中的另外两个类将正确编码的 XML 写出到输出流。

这是我写入通用 IStream 的辅助方法:

class procedure TXMLHelper.WriteDocumentToStream(const Document60: IXMLDOMDocument2; const stream: IStream; Encoding: string = 'UTF-8');
var
    writer: IMXWriter;
    reader: IVBSAXXMLReader;
begin
{
    From http://support.microsoft.com/kb/275883
    INFO: XML Encoding and DOM Interface Methods

    MSXML has native support for the following encodings:
        UTF-8
        UTF-16
        UCS-2
        UCS-4
        ISO-10646-UCS-2
        UNICODE-1-1-UTF-8
        UNICODE-2-0-UTF-16
        UNICODE-2-0-UTF-8

    It also recognizes (internally using the WideCharToMultibyte API function for mappings) the following encodings:
        US-ASCII
        ISO-8859-1
        ISO-8859-2
        ISO-8859-3
        ISO-8859-4
        ISO-8859-5
        ISO-8859-6
        ISO-8859-7
        ISO-8859-8
        ISO-8859-9
        WINDOWS-1250
        WINDOWS-1251
        WINDOWS-1252
        WINDOWS-1253
        WINDOWS-1254
        WINDOWS-1255
        WINDOWS-1256
        WINDOWS-1257
        WINDOWS-1258
}

    if Document60 = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: Document60 cannot be nil');
    if stream = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: stream cannot be nil');

    // Set properties on the XML writer - including BOM, XML declaration and encoding
    writer := CoMXXMLWriter60.Create;
    writer.byteOrderMark := True; //Determines whether to write the Byte Order Mark (BOM). The byteOrderMark property has no effect for BSTR or DOM output. (Default True)
    writer.omitXMLDeclaration := False; //Forces the IMXWriter to skip the XML declaration. Useful for creating document fragments. (Default False)
    writer.encoding := Encoding; //Sets and gets encoding for the output. (Default "UTF-16")
    writer.indent := True; //Sets whether to indent output. (Default False)
    writer.standalone := True;

    // Set the XML writer to the SAX content handler.
    reader := CoSAXXMLReader60.Create;
    reader.contentHandler := writer as IVBSAXContentHandler;
    reader.dtdHandler := writer as IVBSAXDTDHandler;
    reader.errorHandler := writer as IVBSAXErrorHandler;
    reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);
    reader.putProperty('http://xml.org/sax/properties/declaration-handler', writer);


    writer.output := stream; //The resulting document will be written into the provided IStream

    // Now pass the DOM through the SAX handler, and it will call the writer
    reader.parse(Document60);

    writer.flush;
end;

为了保存到文件,我使用 FileStream 调用 Stream 版本:

class procedure TXMLHelper.WriteDocumentToFile(const Document60: IXMLDOMDocument2; const filename: string; Encoding: string='UTF-8');
var
    fs: TFileStream;
begin
    fs := TFileStream.Create(filename, fmCreate or fmShareDenyWrite);
    try
        TXMLHelper.WriteDocumentToStream(Document60, fs, Encoding);
    finally
        fs.Free;
    end;
end;

您可以将函数转换为您喜欢的任何语言。这些是德尔福。


2
投票

执行

load
时,msxml 不会将处理指令中的 encoding 复制到创建的文档中。所以它不包含任何编码,看起来 msxml 选择了它喜欢的东西。在我的环境中,它是 UTF-16,我不喜欢它。

解决方案是提供处理指令并在那里指定编码。如果您知道该文档没有处理指令,则代码很简单:

Set pi = xmlDoc.createProcessingInstruction("xml", _
         "version=""1.0"" encoding=""windows-1250""")
If xmlDoc.childNodes.Length > 0 Then
  Call xmlDoc.insertBefore(pi, xmlDoc.childNodes.Item(0))
End If

如果文档中可能包含其他处理指令,则必须先将其删除(因此下面的代码必须位于上面的代码之前)。我不知道如何使用

selectNode
来做到这一点,所以我只是迭代所有根节点:

For ich=xmlDoc.childNodes.Length-1 to 0 step -1
  Set ch = xmlDoc.childNodes.Item(ich)
  If ch.NodeTypeString = "processinginstruction" and ch.NodeName = "xml" Then
    xmlDoc.removeChild(ch)
  End If
Next ich

抱歉,如果代码不能直接执行,因为我修改了工作版本,这是用自定义的东西编写的,而不是vbscript。


0
投票

此解决方法似乎有效:将 xml 字符串传递到 ADODB.Stream 以保存文件。

更新:处理指令根据此丢失:

https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms755989(v=vs.85) xml 属性始终返回 Unicode 字符串。也就是说,DOMDocument 的 xml 属性将文档从其原始编码转换为 Unicode。这样一来,原来的编码属性就被移除了。

Sub Facturae()
On Error GoTo ExceptionHandling

Dim i As Integer

'Declare document objects
Dim xDoc As MSXML2.DOMDocument60
Dim xRoot As MSXML2.IXMLDOMElement

'Create new DOMDocument
Set xDoc = New DOMDocument60

''Add the XML declaration as a processing instruction:
Dim xmlDecl As MSXML2.IXMLDOMProcessingInstruction
Set xmlDecl = xDoc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-8' standalone='yes'")
xDoc.appendChild xmlDecl

'Create the root element
Set xRoot = xDoc.createElement("fe:Facturae")
xDoc.appendChild xRoot

'The namespace declarations are attributes on the root element, so you can add them using:
xRoot.setAttribute "xmlns:ds", "http://www.w3.org/2000/09/xmldsig#"
xRoot.setAttribute "xmlns:fe", "http://www.facturae.es/Facturae/2009/v3.2/Facturae"
'xDoc.DocumentElement.setAttribute "xmlns:ds", "http://www.w3.org/2000/09/xmldsig#"
'xDoc.DocumentElement.setAttribute "xmlns:fe", "http://www.facturae.es/Facturae/2009/v3.2/Facturae"

'Add child to root
'Create security element
Dim objSecElem As MSXML2.IXMLDOMElement
Set objSecElem = xDoc.createElement("Security")
xRoot.appendChild objSecElem

Dim str(1 To 3) As String
str(1) = "A"
str(2) = "B"
str(3) = "C"

Dim objProp As IXMLDOMElement
For i = 1 To UBound(str)
    Set objProp = xDoc.createElement(str(i))
    objSecElem.appendChild objProp
    objProp.Text = i
Next i

Dim objStream As Stream, strData As String, sFilePath As String
'Debug.Print xDoc.XML
strData = xDoc.XML
sFilePath = ThisWorkbook.Path & "\my_file.xml"

Set objStream = CreateObject("ADODB.Stream")
objStream.Type = adTypeText
objStream.charset = "utf-8"
objStream.LineSeparator = adCRLF
objStream.Open
objStream.WriteText strData, adWriteChar
objStream.SaveToFile sFilePath, adSaveCreateOverWrite

'Save the XML file
'xDoc.Save sFilePath

CleanUp:
    On Error Resume Next
    objStream.Close
    Exit Sub
ExceptionHandling:
    MsgBox "Error: " & Err.Description
    Resume CleanUp
    'https://stackoverflow.com/questions/52205786/excel-vba-global-variables-are-assigned-when-workbook-is-opened-get-erased-if?rq=1
    'error handler
    'Note the unreachable Resume at the end of the error handler. It's unreachable because Resume ExitProcedure will always execute first.
    'So when you get the messagebox, you can use ctrl+break to break into code, which will then take you to the Resume ExitProcedure.
    'You can then drag the yellow arrow over to the unreachable Resume, press F8 to step once which will then take you back to the line that caused the error.
    Resume 'for debugging
End Sub
© www.soinside.com 2019 - 2024. All rights reserved.