如何在 Python 上编辑数据框中的 RTF 值?

问题描述 投票:0回答:1

不幸的是,我无法进一步推进我的“项目”,也找不到令我满意的解决方案,所以我现在转向你,希望我能从你那里得到一个很好的解决方案来解决我的问题。

我想在我的 Python 项目中编写以下代码。 它是关于数据框中的一列,该列带来“RTF”值并存储为数据类型“对象”。 有些 RTF 包含实际的字符串值,例如"" iewkind4\uc1\pard\lang1031 0 s20 这个文本部分是向我们展示字符串值如何存储在数据库中。\par" 就像这里和这里我希望我的代码进行迭代并在迭代结束时返回我只使用“纯”字符串值并将其覆盖在同一行/列中。

所以整个事情看起来像这样:

迭代之前: " iewkind4\uc1\pard\lang1031 0 s20 这个文本部分,是向我们展示字符串值是如何存储在数据库中的。\par" 迭代后:这个文本部分,是向我们展示字符串值如何存储在数据库中。

您还可以找到以下可以实际删除的字符,例如 { tf1 nsi nsicpg1252\deff0{ onttbl{ 0 nil charset0 MS Sans Serif OR \par OR ab 等等。

数据帧中的行在这里始终未知,因为我从 Oracle 数据库中获取数据。不幸的是,字符串值已经以这种方式存储在数据库中。所以底线是,这是纯粹的原始数据。

我希望有人可以或将会帮助我。

提前致谢。


PS: 我对所有可能的解决方案持开放态度。一定有另一种方法可以在 Python 上以不同的方式显示 RTF 属性。

附加问题(可选): “清理” RTF 后,我想合并(妥协)行,即减少行。列中的每个 RTF 均通过 ID 引用(列:TEXTID)。 正如您在所附屏幕截图中看到的,前 19 行是一个完整的“文本”,但被选项卡和其他内容分割。之后又从 1 开始。 在这里,下一步是将所有属于在一起的线组合起来。

如果我也能从你那里得到一个可能的解决方案,我将非常感激。


代码部分: 这是目前我的解决方案,我对此不满意。 迭代从Durch die DataFrames iterieren

开始
# DataFrames mit numerischem Suffix erstellen (z.B. df_1, df_2 usw.)
import os
import glob

# Alle CSV-Dateien im aktuellen Verzeichnis mit "Hauptstufe_" im Namen lesen
csv_files = glob.glob(os.path.join(".", "Auftrag_*.csv"))

dfs = {}
num_df = 0
for file in csv_files:
   num_df += 1
   df_name = f"df_{num_df}"
   dfs[df_name] = pd.read_csv(file)
   locals()[df_name] = dfs[df_name]

# Anzahl der generierten DataFrames und deren Namen ausgeben
df_names = list(dfs.keys())
num_df = len(df_names)
print(f"Es wurden {num_df} DataFrames generiert.")
for name in df_names:
   print(f"DataFrame Name: {name}")

# Durch die DataFrames iterieren
for df in dfs.values():
   
   # Durch die Spalten iterieren
   for col in df.columns:
       
       # Nur die Spalten mit dem Namen "Text" verarbeiten
       if col == "Text":
           
           # Die Werte der Spalte einzeln durchgehen
           for i in range(len(df)):
               
               # Den RTF-Wert extrahieren und bereinigen
               rtf = str(df[col].iloc[i])
               if rtf.startswith("{\\rtf1\\"):
                   start = rtf.find("{", 1)
                   end = rtf.rfind("}")
                   rtf = rtf[start:end]
               
               # Alle Formatierungen entfernen und nur den Text speichern
               text_parts = rtf.split("\\")
               clean_text_parts = []
               for part in text_parts:
                   if not part.startswith(("rtf", "k", "ansi", "deff", "fonttbl", "colortbl", "stylesheet", "pn", "par")):
                       clean_text_parts.append(part.strip())
               clean_text = " ".join(clean_text_parts)
               
               # Den bereinigten Text in die Spalte schreiben
               df.at[i, col] = clean_text
                
#for name, df in dfs.items(): df.to_csv(f"{name}.csv", index=False)

输出PNG:

enter image description here

python dataframe iteration rtf
1个回答
0
投票

我最近也需要一些东西来将大量 rtf 文件转换为纯文本。我使用这篇正则表达式从 RTF 字符串中提取文本文章作为基础,并将其扩展到可以处理许多不同 rtf 文件的程度。它肯定不完整,因为 rtf 标准非常庞大。
您可以使用并测试下面我的代码。

import re
import json
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s %(threadName)s %(name)s %(message)s",
    level=logging.DEBUG,
)

specialchars: dict = {
    "par": "\n",
    "sect": "\n\n",
    "page": "\n\n",
    "line": "\n",
    "tab": "\t",
    "emdash": "\u2014",
    "endash": "\u2013",
    "emspace": "\u2003",
    "enspace": "\u2002",
    "qmspace": "\u2005",
    "bullet": "\u2022",
    "lquote": "\u2018",
    "rquote": "\u2019",
    "ldblquote": "\u201C",
    "rdblquote": "\u201D",
}

specialHexes: dict = {
    "84": "\u201E", # lower double quote left
}

pattern: re.Pattern = re.compile(
    r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]|(.)",
    re.I,
)

encodings: dict = {
    0: "utf8",
    2: "symbol",
    128: "cp932",  # shift_jis
    129: "cp949",  # euc_kr
    134: "cp936",  # gb2312
    135: "cp936",  # gb2312
    136: "cp950",  # big5
    161: "cp1253",  # greece
    162: "cp1254",  # turkish
    163: "cp1254",  # turkish
    204: "cp1251",  # russian
    238: "cp1250",  # east_eu
}

with open("rtf_keywords.json", "r") as f:
    destinations: frozenset = frozenset(json.loads(f.read()))

with open("wingdings_to_utf.json", "r") as f:
    winddingsConvertTable = json.loads(f.read())


class StripRtf:
    def __init__(self) -> None:
        self.hexValue: str = ""
        self.symbolsInWingdings: bool = False
        self.inGroup: bool = False

    def unknowCharsetUsed(self, text: str, charsetNum: int) -> bool:
        return len(re.findall(r"\\f" + str(charsetNum), text)) >= 2

    def decodeHexValue(self, hex: str, encoding: str) -> str:
        byteStr = bytes.fromhex(hex).decode(encoding).encode("utf-8")
        self.hexValue = ""
        return str(byteStr, "utf-8")

    def interpreteHexValue(self, hex: str, encoding: str) -> str:
        if (
            encoding == "cp932"
            or encoding == "cp950"
            or encoding == "cp949"
            or encoding == "cp936"
        ):
            self.hexValue += hex

            if len(self.hexValue) == 2:
                if int(self.hexValue, 16) & 128: # detect leading bit
                    return ""
            try:
                return self.decodeHexValue(self.hexValue, encoding)
            except:
                if encoding == "cp936":  # fallback for gb2312
                    try:
                        return self.decodeHexValue(self.hexValue, "gb18030")
                    except:
                        logging.warnign(
                            f"Can't decode Hex {self.hexValue} for encoding {encoding}"
                        )
        elif (
            encoding == "cp1251"
            or encoding == "cp1253"
            or encoding == "cp1254"
            or encoding == "cp1250"
        ):
            return self.decodeHexValue(hex, encoding)
        elif encoding == "symbol":
            if (
                hex.upper() in winddingsConvertTable.keys()
            ) and self.symbolsInWingdings:
                return chr(winddingsConvertTable[hex.upper()])
            else:
                return ""
        else:
            if hex in specialHexes.keys():
                c = specialHexes[hex]
            else:
                c = chr(int(hex, 16))
            return c

    def stripRtf(self, text: str) -> str:
        charsets = {}

        # try to find all charsets
        for charset in re.findall(r"f\d+[0-9a-zA-Z\\]*fcharset\d+", text):
            relevantNums = re.findall(r"f(?:charset)?\d+", charset)

            charsetNum = int(re.search(r"\d+", relevantNums[0]).group())
            encodingNum = int(re.search(r"\d+", relevantNums[1]).group())

            charsets[charsetNum] = encodingNum
        if charsets:
            defaultCharset = list(charsets.keys())[0]
        else:
            defaultCharset = 0

        stack = []
        ignorable = False  # Whether this group (and all inside it) are "ignorable".
        ucskip = 1  # Number of ASCII characters to skip after a unicode character.
        curskip = 0  # Number of ASCII characters left to skip
        out = []  # Output buffer.

        self.symbolsInWingdings = len(re.findall(r"[Ww]ingdings", text)) >= 0

        # debug for missing charsets, you can comment out the exception but it may not work probably
        for charsetNum, encodingNum in charsets.items():
            if encodingNum not in encodings.keys():
                if self.unknowCharsetUsed(text, charsetNum):
                    raise Exception(
                        f"Found unknown charset-number while parsing RTF: {encodingNum}"
                    )

        encodingChange = False
        encodingHasChanged = False
        if defaultCharset in encodings.keys():
            encoding = encodings[defaultCharset]
        else:
            encoding = encodings[0]

        for match in pattern.finditer(text):
            word, arg, hex, char, brace, tchar = match.groups()
            if brace:
                curskip = 0
                if brace == "{":
                    self.inGroup = True
                    stack.append((ucskip, ignorable))
                elif brace == "}":
                    if encodingHasChanged and self.inGroup:
                        encoding = encodingOld
                        encodingHasChanged = False
                    self.inGroup = False
                    if stack:
                        ucskip, ignorable = stack.pop()
            if char:  # \x (not a letter
                curskip = 0
                if char == "~":
                    if not ignorable:
                        out.append("\xA0")
                elif char in "{}\\":
                    if not ignorable:
                        out.append(char)
                elif char == "*":
                    ignorable = True
            if word:  # \foo
                if word == "f":
                    encodingChange = True

                curskip = 0
                if word in destinations:
                    ignorable = True
                elif ignorable:
                    pass
                elif word in specialchars:
                    out.append(specialchars[word])
                elif word == "uc":
                    ucskip = int(arg)
                elif word == "u":
                    c = int(arg)
                    if c < 0:
                        c += 0x10000
                    if 55296 <= c <= 57343:  # replace surrogates
                        out.append("?")
                    elif c > 127:
                        out.append(chr(c))
                    curskip = ucskip
            if hex:  # \'xx
                if curskip > 0:
                    curskip -= 1
                elif not ignorable:
                    out.append(self.interpreteHexValue(hex, encoding))
            if tchar:
                if curskip > 0:
                    curskip -= 1
                elif not ignorable:
                    out.append(tchar)
            if arg:
                if self.inGroup and encodingChange:
                    encodingOld = encoding
                if encodingChange:
                    encodingChange = False
                    encodingHasChanged = (
                        True  # for reverting encoding when leaving closed group
                    )

                    if int(arg) not in charsets.keys():
                        continue
                    if charsets[int(arg)] not in encodings.keys():
                        continue
                    encoding = encodings[charsets[int(arg)]]

        text = "".join(out).strip()
        # text = re.sub(r"[\n]+", r"\n", text)
        return text

    def revertHexValues(self, text: str) -> str:
        result = ""
        for char in text:
            numChar: int = ord(char)
            if numChar > 127 and numChar < 256:
                hexValue = str(hex(numChar))[2:]
                result += "\\'" + hexValue
            elif numChar >= 256:
                # "{\\rtlch\\fcs1\\f0\\fnil\\fcharset0 \\u" + str(numChar) + "\\'3f}"
                unicodeChar: str = "\\u" + str(numChar) + "\\'3f"
                result += unicodeChar
            else:
                result += char
        return result

对于上面的“rtf_keywords.json”:

[
    "aftncn",
    "aftnsep",
    "aftnsepc",
    "annotation",
    "atnauthor",
    "atndate",
    "atnicn",
    "atnid",
    "atnparent",
    "atnref",
    "atntime",
    "atrfend",
    "atrfstart",
    "author",
    "background",
    "bkmkend",
    "bkmkstart",
    "blipuid",
    "buptim",
    "category",
    "colorschememapping",
    "colortbl",
    "comment",
    "company",
    "creatim",
    "datafield",
    "datastore",
    "defchp",
    "defpap",
    "do",
    "doccomm",
    "docvar",
    "dptxbxtext",
    "ebcend",
    "ebcstart",
    "factoidname",
    "falt",
    "fchars",
    "ffdeftext",
    "ffentrymcr",
    "ffexitmcr",
    "ffformat",
    "ffhelptext",
    "ffl",
    "ffname",
    "ffstattext",
    "field",
    "file",
    "filetbl",
    "fldinst",
    "fldrslt",
    "fldtype",
    "fname",
    "fontemb",
    "fontfile",
    "fonttbl",
    "footer",
    "footerf",
    "footerl",
    "footerr",
    "footnote",
    "formfield",
    "ftncn",
    "ftnsep",
    "ftnsepc",
    "g",
    "generator",
    "gridtbl",
    "header",
    "headerf",
    "headerl",
    "headerr",
    "hl",
    "hlfr",
    "hlinkbase",
    "hlloc",
    "hlsrc",
    "hsv",
    "htmltag",
    "info",
    "keycode",
    "keywords",
    "latentstyles",
    "lchars",
    "levelnumbers",
    "leveltext",
    "lfolevel",
    "linkval",
    "list",
    "listlevel",
    "listname",
    "listoverride",
    "listoverridetable",
    "listpicture",
    "liststylename",
    "listtable",
    "listtext",
    "lsdlockedexcept",
    "macc",
    "maccPr",
    "mailmerge",
    "maln",
    "malnScr",
    "manager",
    "margPr",
    "mbar",
    "mbarPr",
    "mbaseJc",
    "mbegChr",
    "mborderBox",
    "mborderBoxPr",
    "mbox",
    "mboxPr",
    "mchr",
    "mcount",
    "mctrlPr",
    "md",
    "mdeg",
    "mdegHide",
    "mden",
    "mdiff",
    "mdPr",
    "me",
    "mendChr",
    "meqArr",
    "meqArrPr",
    "mf",
    "mfName",
    "mfPr",
    "mfunc",
    "mfuncPr",
    "mgroupChr",
    "mgroupChrPr",
    "mgrow",
    "mhideBot",
    "mhideLeft",
    "mhideRight",
    "mhideTop",
    "mhtmltag",
    "mlim",
    "mlimloc",
    "mlimlow",
    "mlimlowPr",
    "mlimupp",
    "mlimuppPr",
    "mm",
    "mmaddfieldname",
    "mmath",
    "mmathPict",
    "mmathPr",
    "mmaxdist",
    "mmc",
    "mmcJc",
    "mmconnectstr",
    "mmconnectstrdata",
    "mmcPr",
    "mmcs",
    "mmdatasource",
    "mmheadersource",
    "mmmailsubject",
    "mmodso",
    "mmodsofilter",
    "mmodsofldmpdata",
    "mmodsomappedname",
    "mmodsoname",
    "mmodsorecipdata",
    "mmodsosort",
    "mmodsosrc",
    "mmodsotable",
    "mmodsoudl",
    "mmodsoudldata",
    "mmodsouniquetag",
    "mmPr",
    "mmquery",
    "mmr",
    "mnary",
    "mnaryPr",
    "mnoBreak",
    "mnum",
    "mobjDist",
    "moMath",
    "moMathPara",
    "moMathParaPr",
    "mopEmu",
    "mphant",
    "mphantPr",
    "mplcHide",
    "mpos",
    "mr",
    "mrad",
    "mradPr",
    "mrPr",
    "msepChr",
    "mshow",
    "mshp",
    "msPre",
    "msPrePr",
    "msSub",
    "msSubPr",
    "msSubSup",
    "msSubSupPr",
    "msSup",
    "msSupPr",
    "mstrikeBLTR",
    "mstrikeH",
    "mstrikeTLBR",
    "mstrikeV",
    "msub",
    "msubHide",
    "msup",
    "msupHide",
    "mtransp",
    "mtype",
    "mvertJc",
    "mvfmf",
    "mvfml",
    "mvtof",
    "mvtol",
    "mzeroAsc",
    "mzeroDesc",
    "mzeroWid",
    "nesttableprops",
    "nextfile",
    "nonesttables",
    "objalias",
    "objclass",
    "objdata",
    "object",
    "objname",
    "objsect",
    "objtime",
    "oldcprops",
    "oldpprops",
    "oldsprops",
    "oldtprops",
    "oleclsid",
    "operator",
    "panose",
    "password",
    "passwordhash",
    "pgp",
    "pgptbl",
    "picprop",
    "pict",
    "pn",
    "pnseclvl",
    "pntext",
    "pntxta",
    "pntxtb",
    "printim",
    "private",
    "propname",
    "protend",
    "protstart",
    "protusertbl",
    "pxe",
    "result",
    "revtbl",
    "revtim",
    "rsidtbl",
    "rxe",
    "shp",
    "shpgrp",
    "shpinst",
    "shppict",
    "shprslt",
    "shptxt",
    "sn",
    "sp",
    "staticval",
    "stylesheet",
    "subject",
    "sv",
    "svb",
    "tc",
    "template",
    "themedata",
    "title",
    "txe",
    "ud",
    "upr",
    "userprops",
    "wgrffmtfilter",
    "windowcaption",
    "writereservation",
    "writereservhash",
    "xe",
    "xform",
    "xmlattrname",
    "xmlattrvalue",
    "xmlclose",
    "xmlname",
    "xmlnstbl",
    "xmlopen"
]

和“wingdings_to_utf.json”:

{
    "20": 32,
    "21": 128393,
    "22": 9986,
    "23": 9985,
    "24": 128083,
    "25": 128365,
    "26": 128366,
    "27": 128367,
    "28": 128383,
    "29": 9990,
    "2A": 128386,
    "2B": 128387,
    "2C": 128234,
    "2D": 128235,
    "2E": 128236,
    "2F": 128237,
    "30": 128193,
    "31": 128194,
    "32": 128196,
    "33": 128463,
    "34": 128464,
    "35": 128452,
    "36": 8987,
    "37": 128430,
    "38": 128432,
    "39": 128434,
    "3A": 128435,
    "3B": 128436,
    "3C": 128427,
    "3D": 128428,
    "3E": 9991,
    "3F": 9997,
    "40": 128398,
    "41": 9996,
    "42": 128076,
    "43": 128077,
    "44": 128078,
    "45": 9756,
    "46": 9758,
    "47": 9757,
    "48": 9759,
    "49": 128400,
    "4A": 9786,
    "4B": 128528,
    "4C": 9785,
    "4D": 128163,
    "4E": 9760,
    "4F": 127987,
    "50": 127985,
    "51": 9992,
    "52": 9788,
    "53": 128167,
    "54": 10052,
    "55": 128326,
    "56": 10014,
    "57": 128328,
    "58": 10016,
    "59": 10017,
    "5A": 9770,
    "5B": 9775,
    "5C": 2384,
    "5D": 9784,
    "5E": 9800,
    "5F": 9801,
    "60": 9802,
    "61": 9803,
    "62": 9804,
    "63": 9805,
    "64": 9806,
    "65": 9807,
    "66": 9808,
    "67": 9809,
    "68": 9810,
    "69": 9811,
    "6A": 128624,
    "6B": 128629,
    "6C": 9679,
    "6D": 128318,
    "6E": 9632,
    "6F": 9633,
    "70": 128912,
    "71": 10065,
    "72": 10066,
    "73": 11047,
    "74": 10731,
    "75": 9670,
    "76": 10070,
    "77": 11045,
    "78": 8999,
    "79": 11193,
    "7A": 8984,
    "7B": 127989,
    "7C": 127990,
    "7D": 128630,
    "7E": 128631,
    "80": 9450,
    "81": 9312,
    "82": 9313,
    "83": 9314,
    "84": 9315,
    "85": 9316,
    "86": 9317,
    "87": 9318,
    "88": 9319,
    "89": 9320,
    "8A": 9321,
    "8B": 9471,
    "8C": 10102,
    "8D": 10103,
    "8E": 10104,
    "8F": 10105,
    "90": 10106,
    "91": 10107,
    "92": 10108,
    "93": 10109,
    "94": 10110,
    "95": 10111,
    "96": 128610,
    "97": 128608,
    "98": 128609,
    "99": 128611,
    "9A": 128606,
    "9B": 128604,
    "9C": 128605,
    "9D": 128607,
    "9E": 183,
    "9F": 8226,
    "A0": 9642,
    "A1": 9898,
    "A2": 128902,
    "A3": 128904,
    "A4": 9673,
    "A5": 9678,
    "A6": 128319,
    "A7": 9642,
    "A8": 9723,
    "A9": 128962,
    "AA": 10022,
    "AB": 9733,
    "AC": 10038,
    "AD": 10036,
    "AE": 10041,
    "AF": 10037,
    "B0": 11216,
    "B1": 8982,
    "B2": 10209,
    "B3": 8977,
    "B4": 11217,
    "B5": 10026,
    "B6": 10032,
    "B7": 128336,
    "B8": 128337,
    "B9": 128338,
    "BA": 128339,
    "BB": 128340,
    "BC": 128341,
    "BD": 128342,
    "BE": 128343,
    "BF": 128344,
    "C0": 128345,
    "C1": 128346,
    "C2": 128347,
    "C3": 11184,
    "C4": 11185,
    "C5": 11186,
    "C6": 11187,
    "C7": 11188,
    "C8": 11189,
    "C9": 11190,
    "CA": 11191,
    "CB": 128618,
    "CC": 128619,
    "CD": 128597,
    "CE": 128596,
    "CF": 128599,
    "D0": 128598,
    "D1": 128592,
    "D2": 128593,
    "D3": 128594,
    "D4": 128595,
    "D5": 9003,
    "D6": 8998,
    "D7": 11160,
    "D8": 11162,
    "D9": 11161,
    "DA": 11163,
    "DB": 11144,
    "DC": 11146,
    "DD": 11145,
    "DE": 11147,
    "DF": 129128,
    "E0": 129130,
    "E1": 129129,
    "E2": 129131,
    "E3": 129132,
    "E4": 129133,
    "E5": 129135,
    "E6": 129134,
    "E7": 129144,
    "E8": 129146,
    "E9": 129145,
    "EA": 129147,
    "EB": 129148,
    "EC": 129149,
    "ED": 129151,
    "EE": 129150,
    "EF": 8678,
    "F0": 8680,
    "F1": 8679,
    "F2": 8681,
    "F3": 11012,
    "F4": 8691,
    "F5": 11008,
    "F6": 11009,
    "F7": 11011,
    "F8": 11010,
    "F9": 129196,
    "FA": 129197,
    "FB": 128502,
    "FC": 10004,
    "FD": 128503,
    "FE": 128505,
    "FF": 8862
}

上面的示例将给出“此文本部分,是向我们展示字符串值如何存储在数据库中。”,即使它不是有效的 rtf 文件。
revertHexValues 将其转换回有效的 rtf。
要测试只需使用:

sr = StripRtf()
text = sr.stripRtf("\viewkind4\uc1\pard\lang1031\f0\fs20 This text section, is to show us how string values are stored in the database.\par")
print(text)
© www.soinside.com 2019 - 2024. All rights reserved.