是否可以使用 PdfSharp 从 PDF 文件中提取纯文本? 我不想使用 iTextSharp,因为它的许可证。
借鉴了Sergio的回答,做了一些扩展方法。我也把字符串的累加改成了迭代器
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}
我已经以某种类似于大卫的方式实施了它。 这是我的代码:
...
{
// ....
var page = document.Pages[1];
CObject content = ContentReader.ReadContent(page);
var extractedText = ExtractText(content);
// ...
}
private IEnumerable<string> ExtractText(CObject cObject)
{
var textList = new List<string>();
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
textList.AddRange(ExtractText(cOperand));
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
{
textList.AddRange(ExtractText(element));
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
textList.Add(cString.Value);
}
return textList;
}
使用这种方法,我实际上最近想出了如何为你们所说的 unicode 做这件事。但它不完全是 unicode,它是 PdfEncoding。嵌入式字体导致 pdf 生成称为 CMap 的差异表,您必须存储这些表并换出 pdfEncoding unicode 值,直到您在 cmap 表中找到一个并将其放在那里。我将符号转换为可读文本,并且花了 3 周时间学习 pdf 文件结构。您还需要 sharpZipLib 来膨胀压缩的 cmap 表。