我有一个C#应用程序,可以使用OpenXML从word(.docx)文件中读取文本。
通常,有一组段落(p)包含运行元素(r)。我可以使用
遍历“运行”节点foreach ( var run in para.Descendants<Run>() )
{
...
}
在一个特定的文档中有一个文本“ START”,它分为三部分,“ ST”,“ AR”和“ T”。它们中的每一个都由“运行”节点定义,但是在两种情况下,“运行”节点包含在“ smartTag”节点中。
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t>ST</w:t>
</w:r>
</w:smartTag>
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t>AR</w:t>
</w:r>
</w:smartTag>
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t xml:space="preserve">T</w:t>
</w:r>
据我所知,OpenXML不支持smartTag节点。结果,它仅生成OpenXmlUnknownElement节点。
造成这一困难的原因是,它会为smartTag的所有后代节点生成OpenXmlUnknownElement节点。这意味着我不能简单地获取第一个子节点并将其强制转换为Run。
通过InnerText属性获取文本很容易,但是我还需要获取格式信息。
是否有任何合理简便的方法来处理此问题?
目前,我最好的想法是编写一个预处理器,以删除智能标记节点。
编辑
关注辛迪·梅斯特的评论。
我正在使用OpenXml版本2.7.2。正如Cindy所指出的那样,OpenXML 2.0中有一个SmartTagRun类。我不知道那堂课。
我在What's new in the Open XML SDK 2.5 for Office页上找到了以下信息
智能标签
由于智能标记在Office 2010中已弃用,因此Open XML SDK2.5不支持与智能标记相关的Open XML元素。 Open XML SDK 2.5仍然可以将智能标记元素作为未知元素处理,但是,用于Office的Open XML SDK 2.5生产率工具可以验证Office文档文件中的那些元素(请参阅以下列表)为无效的标签。
因此,听起来可能的解决方案是使用OpenXML 2.0。
解决方案是使用Linq to XML(如果喜欢,可以使用System.Xml
类),如以下代码所示,删除w:smartTag
元素:
public class SmartTagTests
{
private const string Xml =
@"<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
<w:body>
<w:p>
<w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t>ST</w:t>
</w:r>
</w:smartTag>
<w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t>AR</w:t>
</w:r>
</w:smartTag>
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t xml:space=""preserve"">T</w:t>
</w:r>
</w:p>
</w:body>
</w:document>";
[Fact]
public void CanStripSmartTags()
{
// Say you have a WordprocessingDocument stored on a stream (e.g., read from a file).
using Stream stream = CreateTestWordprocessingDocument();
// Open the WordprocessingDocument and inspect it using the strongly typed classes.
// This shows that we find OpenXmlUnknownElement instances are found and only a
// single Run instance is recognized.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
{
// Now, get the w:document as a strongly typed Document instance and demonstrate
// that the document contains three Run instances.
MainDocumentPart part = wordDocument.MainDocumentPart;
Document document = part.Document;
Assert.Single(document.Descendants<Run>());
Assert.NotEmpty(document.Descendants<OpenXmlUnknownElement>());
}
// Now, open that WordprocessingDocument to make edits, using Linq to XML.
// Do NOT use the strongly typed classes in this context.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
{
// Get the w:document as an XElement and demonstrate that this w:document contains
// w:smartTag elements.
MainDocumentPart part = wordDocument.MainDocumentPart;
string xml = ReadString(part);
XElement document = XElement.Parse(xml);
Assert.NotEmpty(document.Descendants().Where(d => d.Name.LocalName == "smartTag"));
// Transform the w:document, stripping all w:smartTag elements and demonstrate
// that the transformed w:document no longer contains w:smartTag elements.
var transformedDocument = (XElement) StripSmartTags(document);
Assert.Empty(transformedDocument.Descendants().Where(d => d.Name.LocalName == "smartTag"));
// Write the transformed document back to the part.
WriteString(part, transformedDocument.ToString(SaveOptions.DisableFormatting));
}
// Open the WordprocessingDocument again and inspect it using the strongly typed classes.
// This demonstrates that all Run instances are now recognized.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
{
// Now, get the w:document as a strongly typed Document instance and demonstrate
// that the document contains three Run instances.
MainDocumentPart part = wordDocument.MainDocumentPart;
Document document = part.Document;
Assert.Equal(3, document.Descendants<Run>().Count());
Assert.Empty(document.Descendants<OpenXmlUnknownElement>());
}
}
/// <summary>
/// Recursive, pure functional transform that removes all w:smartTag elements.
/// </summary>
/// <param name="node">The <see cref="XNode" /> to be transformed.</param>
/// <returns>The transformed <see cref="XNode" />.</returns>
private static object StripSmartTags(XNode node)
{
if (!(node is XElement element))
{
return node;
}
if (element.Name.LocalName == "smartTag")
{
return element.Elements();
}
return new XElement(element.Name, element.Attributes(),
element.Nodes().Select(StripSmartTags));
}
private static Stream CreateTestWordprocessingDocument()
{
var stream = new MemoryStream();
using var wordDocument = WordprocessingDocument.Create(stream, WordprocessingDocumentType.Document);
MainDocumentPart part = wordDocument.AddMainDocumentPart();
WriteString(part, Xml);
return stream;
}
#region Generic Open XML Utilities
private static string ReadString(OpenXmlPart part)
{
using Stream stream = part.GetStream(FileMode.Open, FileAccess.Read);
using var streamReader = new StreamReader(stream);
return streamReader.ReadToEnd();
}
private static void WriteString(OpenXmlPart part, string text)
{
using Stream stream = part.GetStream(FileMode.Create, FileAccess.Write);
using var streamWriter = new StreamWriter(stream);
streamWriter.Write(text);
}
#endregion
}
您还可以使用PowerTools for Open XML,它提供了直接支持删除w:smartTag
元素的标记简化器。