不幸的是,我无法进一步推进我的“项目”,也找不到令我满意的解决方案,所以我现在转向你,希望我能从你那里得到一个很好的解决方案来解决我的问题。
我想在我的 Python 项目中编写以下代码。 它是关于数据框中的一列,该列带来“RTF”值并存储为数据类型“对象”。 有些 RTF 包含实际的字符串值,例如"" iewkind4\uc1\pard\lang1031 0 s20 这个文本部分是向我们展示字符串值如何存储在数据库中。\par" 就像这里和这里我希望我的代码进行迭代并在迭代结束时返回我只使用“纯”字符串值并将其覆盖在同一行/列中。
所以整个事情看起来像这样:
迭代之前: " iewkind4\uc1\pard\lang1031 0 s20 这个文本部分,是向我们展示字符串值是如何存储在数据库中的。\par" 迭代后:这个文本部分,是向我们展示字符串值如何存储在数据库中。
您还可以找到以下可以实际删除的字符,例如 { tf1 nsi nsicpg1252\deff0{ onttbl{ 0 nil charset0 MS Sans Serif OR \par OR ab 等等。
数据帧中的行在这里始终未知,因为我从 Oracle 数据库中获取数据。不幸的是,字符串值已经以这种方式存储在数据库中。所以底线是,这是纯粹的原始数据。
我希望有人可以或将会帮助我。
提前致谢。
PS: 我对所有可能的解决方案持开放态度。一定有另一种方法可以在 Python 上以不同的方式显示 RTF 属性。
附加问题(可选): “清理” RTF 后,我想合并(妥协)行,即减少行。列中的每个 RTF 均通过 ID 引用(列:TEXTID)。 正如您在所附屏幕截图中看到的,前 19 行是一个完整的“文本”,但被选项卡和其他内容分割。之后又从 1 开始。 在这里,下一步是将所有属于在一起的线组合起来。
如果我也能从你那里得到一个可能的解决方案,我将非常感激。
代码部分: 这是目前我的解决方案,我对此不满意。 迭代从Durch die DataFrames iterieren
开始# DataFrames mit numerischem Suffix erstellen (z.B. df_1, df_2 usw.)
import os
import glob
# Alle CSV-Dateien im aktuellen Verzeichnis mit "Hauptstufe_" im Namen lesen
csv_files = glob.glob(os.path.join(".", "Auftrag_*.csv"))
dfs = {}
num_df = 0
for file in csv_files:
num_df += 1
df_name = f"df_{num_df}"
dfs[df_name] = pd.read_csv(file)
locals()[df_name] = dfs[df_name]
# Anzahl der generierten DataFrames und deren Namen ausgeben
df_names = list(dfs.keys())
num_df = len(df_names)
print(f"Es wurden {num_df} DataFrames generiert.")
for name in df_names:
print(f"DataFrame Name: {name}")
# Durch die DataFrames iterieren
for df in dfs.values():
# Durch die Spalten iterieren
for col in df.columns:
# Nur die Spalten mit dem Namen "Text" verarbeiten
if col == "Text":
# Die Werte der Spalte einzeln durchgehen
for i in range(len(df)):
# Den RTF-Wert extrahieren und bereinigen
rtf = str(df[col].iloc[i])
if rtf.startswith("{\\rtf1\\"):
start = rtf.find("{", 1)
end = rtf.rfind("}")
rtf = rtf[start:end]
# Alle Formatierungen entfernen und nur den Text speichern
text_parts = rtf.split("\\")
clean_text_parts = []
for part in text_parts:
if not part.startswith(("rtf", "k", "ansi", "deff", "fonttbl", "colortbl", "stylesheet", "pn", "par")):
clean_text_parts.append(part.strip())
clean_text = " ".join(clean_text_parts)
# Den bereinigten Text in die Spalte schreiben
df.at[i, col] = clean_text
#for name, df in dfs.items(): df.to_csv(f"{name}.csv", index=False)
输出PNG:
我最近也需要一些东西来将大量 rtf 文件转换为纯文本。我使用这篇正则表达式从 RTF 字符串中提取文本文章作为基础,并将其扩展到可以处理许多不同 rtf 文件的程度。它肯定不完整,因为 rtf 标准非常庞大。
您可以使用并测试下面我的代码。
import re
import json
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s %(threadName)s %(name)s %(message)s",
level=logging.DEBUG,
)
specialchars: dict = {
"par": "\n",
"sect": "\n\n",
"page": "\n\n",
"line": "\n",
"tab": "\t",
"emdash": "\u2014",
"endash": "\u2013",
"emspace": "\u2003",
"enspace": "\u2002",
"qmspace": "\u2005",
"bullet": "\u2022",
"lquote": "\u2018",
"rquote": "\u2019",
"ldblquote": "\u201C",
"rdblquote": "\u201D",
}
specialHexes: dict = {
"84": "\u201E", # lower double quote left
}
pattern: re.Pattern = re.compile(
r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]|(.)",
re.I,
)
encodings: dict = {
0: "utf8",
2: "symbol",
128: "cp932", # shift_jis
129: "cp949", # euc_kr
134: "cp936", # gb2312
135: "cp936", # gb2312
136: "cp950", # big5
161: "cp1253", # greece
162: "cp1254", # turkish
163: "cp1254", # turkish
204: "cp1251", # russian
238: "cp1250", # east_eu
}
with open("rtf_keywords.json", "r") as f:
destinations: frozenset = frozenset(json.loads(f.read()))
with open("wingdings_to_utf.json", "r") as f:
winddingsConvertTable = json.loads(f.read())
class StripRtf:
def __init__(self) -> None:
self.hexValue: str = ""
self.symbolsInWingdings: bool = False
self.inGroup: bool = False
def unknowCharsetUsed(self, text: str, charsetNum: int) -> bool:
return len(re.findall(r"\\f" + str(charsetNum), text)) >= 2
def decodeHexValue(self, hex: str, encoding: str) -> str:
byteStr = bytes.fromhex(hex).decode(encoding).encode("utf-8")
self.hexValue = ""
return str(byteStr, "utf-8")
def interpreteHexValue(self, hex: str, encoding: str) -> str:
if (
encoding == "cp932"
or encoding == "cp950"
or encoding == "cp949"
or encoding == "cp936"
):
self.hexValue += hex
if len(self.hexValue) == 2:
if int(self.hexValue, 16) & 128: # detect leading bit
return ""
try:
return self.decodeHexValue(self.hexValue, encoding)
except:
if encoding == "cp936": # fallback for gb2312
try:
return self.decodeHexValue(self.hexValue, "gb18030")
except:
logging.warnign(
f"Can't decode Hex {self.hexValue} for encoding {encoding}"
)
elif (
encoding == "cp1251"
or encoding == "cp1253"
or encoding == "cp1254"
or encoding == "cp1250"
):
return self.decodeHexValue(hex, encoding)
elif encoding == "symbol":
if (
hex.upper() in winddingsConvertTable.keys()
) and self.symbolsInWingdings:
return chr(winddingsConvertTable[hex.upper()])
else:
return ""
else:
if hex in specialHexes.keys():
c = specialHexes[hex]
else:
c = chr(int(hex, 16))
return c
def stripRtf(self, text: str) -> str:
charsets = {}
# try to find all charsets
for charset in re.findall(r"f\d+[0-9a-zA-Z\\]*fcharset\d+", text):
relevantNums = re.findall(r"f(?:charset)?\d+", charset)
charsetNum = int(re.search(r"\d+", relevantNums[0]).group())
encodingNum = int(re.search(r"\d+", relevantNums[1]).group())
charsets[charsetNum] = encodingNum
if charsets:
defaultCharset = list(charsets.keys())[0]
else:
defaultCharset = 0
stack = []
ignorable = False # Whether this group (and all inside it) are "ignorable".
ucskip = 1 # Number of ASCII characters to skip after a unicode character.
curskip = 0 # Number of ASCII characters left to skip
out = [] # Output buffer.
self.symbolsInWingdings = len(re.findall(r"[Ww]ingdings", text)) >= 0
# debug for missing charsets, you can comment out the exception but it may not work probably
for charsetNum, encodingNum in charsets.items():
if encodingNum not in encodings.keys():
if self.unknowCharsetUsed(text, charsetNum):
raise Exception(
f"Found unknown charset-number while parsing RTF: {encodingNum}"
)
encodingChange = False
encodingHasChanged = False
if defaultCharset in encodings.keys():
encoding = encodings[defaultCharset]
else:
encoding = encodings[0]
for match in pattern.finditer(text):
word, arg, hex, char, brace, tchar = match.groups()
if brace:
curskip = 0
if brace == "{":
self.inGroup = True
stack.append((ucskip, ignorable))
elif brace == "}":
if encodingHasChanged and self.inGroup:
encoding = encodingOld
encodingHasChanged = False
self.inGroup = False
if stack:
ucskip, ignorable = stack.pop()
if char: # \x (not a letter
curskip = 0
if char == "~":
if not ignorable:
out.append("\xA0")
elif char in "{}\\":
if not ignorable:
out.append(char)
elif char == "*":
ignorable = True
if word: # \foo
if word == "f":
encodingChange = True
curskip = 0
if word in destinations:
ignorable = True
elif ignorable:
pass
elif word in specialchars:
out.append(specialchars[word])
elif word == "uc":
ucskip = int(arg)
elif word == "u":
c = int(arg)
if c < 0:
c += 0x10000
if 55296 <= c <= 57343: # replace surrogates
out.append("?")
elif c > 127:
out.append(chr(c))
curskip = ucskip
if hex: # \'xx
if curskip > 0:
curskip -= 1
elif not ignorable:
out.append(self.interpreteHexValue(hex, encoding))
if tchar:
if curskip > 0:
curskip -= 1
elif not ignorable:
out.append(tchar)
if arg:
if self.inGroup and encodingChange:
encodingOld = encoding
if encodingChange:
encodingChange = False
encodingHasChanged = (
True # for reverting encoding when leaving closed group
)
if int(arg) not in charsets.keys():
continue
if charsets[int(arg)] not in encodings.keys():
continue
encoding = encodings[charsets[int(arg)]]
text = "".join(out).strip()
# text = re.sub(r"[\n]+", r"\n", text)
return text
def revertHexValues(self, text: str) -> str:
result = ""
for char in text:
numChar: int = ord(char)
if numChar > 127 and numChar < 256:
hexValue = str(hex(numChar))[2:]
result += "\\'" + hexValue
elif numChar >= 256:
# "{\\rtlch\\fcs1\\f0\\fnil\\fcharset0 \\u" + str(numChar) + "\\'3f}"
unicodeChar: str = "\\u" + str(numChar) + "\\'3f"
result += unicodeChar
else:
result += char
return result
对于上面的“rtf_keywords.json”:
[
"aftncn",
"aftnsep",
"aftnsepc",
"annotation",
"atnauthor",
"atndate",
"atnicn",
"atnid",
"atnparent",
"atnref",
"atntime",
"atrfend",
"atrfstart",
"author",
"background",
"bkmkend",
"bkmkstart",
"blipuid",
"buptim",
"category",
"colorschememapping",
"colortbl",
"comment",
"company",
"creatim",
"datafield",
"datastore",
"defchp",
"defpap",
"do",
"doccomm",
"docvar",
"dptxbxtext",
"ebcend",
"ebcstart",
"factoidname",
"falt",
"fchars",
"ffdeftext",
"ffentrymcr",
"ffexitmcr",
"ffformat",
"ffhelptext",
"ffl",
"ffname",
"ffstattext",
"field",
"file",
"filetbl",
"fldinst",
"fldrslt",
"fldtype",
"fname",
"fontemb",
"fontfile",
"fonttbl",
"footer",
"footerf",
"footerl",
"footerr",
"footnote",
"formfield",
"ftncn",
"ftnsep",
"ftnsepc",
"g",
"generator",
"gridtbl",
"header",
"headerf",
"headerl",
"headerr",
"hl",
"hlfr",
"hlinkbase",
"hlloc",
"hlsrc",
"hsv",
"htmltag",
"info",
"keycode",
"keywords",
"latentstyles",
"lchars",
"levelnumbers",
"leveltext",
"lfolevel",
"linkval",
"list",
"listlevel",
"listname",
"listoverride",
"listoverridetable",
"listpicture",
"liststylename",
"listtable",
"listtext",
"lsdlockedexcept",
"macc",
"maccPr",
"mailmerge",
"maln",
"malnScr",
"manager",
"margPr",
"mbar",
"mbarPr",
"mbaseJc",
"mbegChr",
"mborderBox",
"mborderBoxPr",
"mbox",
"mboxPr",
"mchr",
"mcount",
"mctrlPr",
"md",
"mdeg",
"mdegHide",
"mden",
"mdiff",
"mdPr",
"me",
"mendChr",
"meqArr",
"meqArrPr",
"mf",
"mfName",
"mfPr",
"mfunc",
"mfuncPr",
"mgroupChr",
"mgroupChrPr",
"mgrow",
"mhideBot",
"mhideLeft",
"mhideRight",
"mhideTop",
"mhtmltag",
"mlim",
"mlimloc",
"mlimlow",
"mlimlowPr",
"mlimupp",
"mlimuppPr",
"mm",
"mmaddfieldname",
"mmath",
"mmathPict",
"mmathPr",
"mmaxdist",
"mmc",
"mmcJc",
"mmconnectstr",
"mmconnectstrdata",
"mmcPr",
"mmcs",
"mmdatasource",
"mmheadersource",
"mmmailsubject",
"mmodso",
"mmodsofilter",
"mmodsofldmpdata",
"mmodsomappedname",
"mmodsoname",
"mmodsorecipdata",
"mmodsosort",
"mmodsosrc",
"mmodsotable",
"mmodsoudl",
"mmodsoudldata",
"mmodsouniquetag",
"mmPr",
"mmquery",
"mmr",
"mnary",
"mnaryPr",
"mnoBreak",
"mnum",
"mobjDist",
"moMath",
"moMathPara",
"moMathParaPr",
"mopEmu",
"mphant",
"mphantPr",
"mplcHide",
"mpos",
"mr",
"mrad",
"mradPr",
"mrPr",
"msepChr",
"mshow",
"mshp",
"msPre",
"msPrePr",
"msSub",
"msSubPr",
"msSubSup",
"msSubSupPr",
"msSup",
"msSupPr",
"mstrikeBLTR",
"mstrikeH",
"mstrikeTLBR",
"mstrikeV",
"msub",
"msubHide",
"msup",
"msupHide",
"mtransp",
"mtype",
"mvertJc",
"mvfmf",
"mvfml",
"mvtof",
"mvtol",
"mzeroAsc",
"mzeroDesc",
"mzeroWid",
"nesttableprops",
"nextfile",
"nonesttables",
"objalias",
"objclass",
"objdata",
"object",
"objname",
"objsect",
"objtime",
"oldcprops",
"oldpprops",
"oldsprops",
"oldtprops",
"oleclsid",
"operator",
"panose",
"password",
"passwordhash",
"pgp",
"pgptbl",
"picprop",
"pict",
"pn",
"pnseclvl",
"pntext",
"pntxta",
"pntxtb",
"printim",
"private",
"propname",
"protend",
"protstart",
"protusertbl",
"pxe",
"result",
"revtbl",
"revtim",
"rsidtbl",
"rxe",
"shp",
"shpgrp",
"shpinst",
"shppict",
"shprslt",
"shptxt",
"sn",
"sp",
"staticval",
"stylesheet",
"subject",
"sv",
"svb",
"tc",
"template",
"themedata",
"title",
"txe",
"ud",
"upr",
"userprops",
"wgrffmtfilter",
"windowcaption",
"writereservation",
"writereservhash",
"xe",
"xform",
"xmlattrname",
"xmlattrvalue",
"xmlclose",
"xmlname",
"xmlnstbl",
"xmlopen"
]
和“wingdings_to_utf.json”:
{
"20": 32,
"21": 128393,
"22": 9986,
"23": 9985,
"24": 128083,
"25": 128365,
"26": 128366,
"27": 128367,
"28": 128383,
"29": 9990,
"2A": 128386,
"2B": 128387,
"2C": 128234,
"2D": 128235,
"2E": 128236,
"2F": 128237,
"30": 128193,
"31": 128194,
"32": 128196,
"33": 128463,
"34": 128464,
"35": 128452,
"36": 8987,
"37": 128430,
"38": 128432,
"39": 128434,
"3A": 128435,
"3B": 128436,
"3C": 128427,
"3D": 128428,
"3E": 9991,
"3F": 9997,
"40": 128398,
"41": 9996,
"42": 128076,
"43": 128077,
"44": 128078,
"45": 9756,
"46": 9758,
"47": 9757,
"48": 9759,
"49": 128400,
"4A": 9786,
"4B": 128528,
"4C": 9785,
"4D": 128163,
"4E": 9760,
"4F": 127987,
"50": 127985,
"51": 9992,
"52": 9788,
"53": 128167,
"54": 10052,
"55": 128326,
"56": 10014,
"57": 128328,
"58": 10016,
"59": 10017,
"5A": 9770,
"5B": 9775,
"5C": 2384,
"5D": 9784,
"5E": 9800,
"5F": 9801,
"60": 9802,
"61": 9803,
"62": 9804,
"63": 9805,
"64": 9806,
"65": 9807,
"66": 9808,
"67": 9809,
"68": 9810,
"69": 9811,
"6A": 128624,
"6B": 128629,
"6C": 9679,
"6D": 128318,
"6E": 9632,
"6F": 9633,
"70": 128912,
"71": 10065,
"72": 10066,
"73": 11047,
"74": 10731,
"75": 9670,
"76": 10070,
"77": 11045,
"78": 8999,
"79": 11193,
"7A": 8984,
"7B": 127989,
"7C": 127990,
"7D": 128630,
"7E": 128631,
"80": 9450,
"81": 9312,
"82": 9313,
"83": 9314,
"84": 9315,
"85": 9316,
"86": 9317,
"87": 9318,
"88": 9319,
"89": 9320,
"8A": 9321,
"8B": 9471,
"8C": 10102,
"8D": 10103,
"8E": 10104,
"8F": 10105,
"90": 10106,
"91": 10107,
"92": 10108,
"93": 10109,
"94": 10110,
"95": 10111,
"96": 128610,
"97": 128608,
"98": 128609,
"99": 128611,
"9A": 128606,
"9B": 128604,
"9C": 128605,
"9D": 128607,
"9E": 183,
"9F": 8226,
"A0": 9642,
"A1": 9898,
"A2": 128902,
"A3": 128904,
"A4": 9673,
"A5": 9678,
"A6": 128319,
"A7": 9642,
"A8": 9723,
"A9": 128962,
"AA": 10022,
"AB": 9733,
"AC": 10038,
"AD": 10036,
"AE": 10041,
"AF": 10037,
"B0": 11216,
"B1": 8982,
"B2": 10209,
"B3": 8977,
"B4": 11217,
"B5": 10026,
"B6": 10032,
"B7": 128336,
"B8": 128337,
"B9": 128338,
"BA": 128339,
"BB": 128340,
"BC": 128341,
"BD": 128342,
"BE": 128343,
"BF": 128344,
"C0": 128345,
"C1": 128346,
"C2": 128347,
"C3": 11184,
"C4": 11185,
"C5": 11186,
"C6": 11187,
"C7": 11188,
"C8": 11189,
"C9": 11190,
"CA": 11191,
"CB": 128618,
"CC": 128619,
"CD": 128597,
"CE": 128596,
"CF": 128599,
"D0": 128598,
"D1": 128592,
"D2": 128593,
"D3": 128594,
"D4": 128595,
"D5": 9003,
"D6": 8998,
"D7": 11160,
"D8": 11162,
"D9": 11161,
"DA": 11163,
"DB": 11144,
"DC": 11146,
"DD": 11145,
"DE": 11147,
"DF": 129128,
"E0": 129130,
"E1": 129129,
"E2": 129131,
"E3": 129132,
"E4": 129133,
"E5": 129135,
"E6": 129134,
"E7": 129144,
"E8": 129146,
"E9": 129145,
"EA": 129147,
"EB": 129148,
"EC": 129149,
"ED": 129151,
"EE": 129150,
"EF": 8678,
"F0": 8680,
"F1": 8679,
"F2": 8681,
"F3": 11012,
"F4": 8691,
"F5": 11008,
"F6": 11009,
"F7": 11011,
"F8": 11010,
"F9": 129196,
"FA": 129197,
"FB": 128502,
"FC": 10004,
"FD": 128503,
"FE": 128505,
"FF": 8862
}
上面的示例将给出“此文本部分,是向我们展示字符串值如何存储在数据库中。”,即使它不是有效的 rtf 文件。
revertHexValues 将其转换回有效的 rtf。
要测试只需使用:
sr = StripRtf()
text = sr.stripRtf("\viewkind4\uc1\pard\lang1031\f0\fs20 This text section, is to show us how string values are stored in the database.\par")
print(text)