如何优化此 XML 解析循环以提高速度?

问题描述 投票:0回答:1

我编写了一段代码,可以解析大约一百个 XML 文件并创建一个数据帧。该代码运行良好,但可能需要相当长的时间(不到一个小时)才能运行。我确信有一种方法可以通过仅在循环末尾使用数据帧对象来改进此循环,或者也许您不需要三重嵌套循环将所有信息解析到数据帧中,但这是唯一的方法我作为新手就能做到这一点。

我的代码如下所示:

from bs4 import BeautifulSoup
import pandas as pd
import lxml
import json
import os

os.chdir(r"path_to_output_file/output_file")
f_list = os.listdir()

df_list = []

output_files = []
# checking we only iterate over XML files containing "calc_output"
for calc_output in f_list:
    if "calc_output" in calc_output and calc_output.endswith(".xml"):
        output_files.append(calc_output)
        
for calc_output in output_files:
    with open(calc_output, "r") as datas:
        print(f"reading file {calc_output} ...")

        doc = BeautifulSoup(datas.read(), "lxml")

        rows = []
        timestamps = doc.time.find_all("timestamp")
        for timestamp in timestamps: # parsing through every timestamp element
            row= {}
            time = timestamp.get("time") # reading timestamp attributes
            temperature = timestamp.get("temperature")
            zone_id = doc.zone.get("zone_id")
            time_id = timestamp.get("time_id")
            row.update({"time":time, "temperature":temperature, "time_id":time_id, "zone_id":zone_id})
            row_copy = row.copy()
            rows.append(row_copy)

        # creating temporary dataframe to combine with other info
        df1 = pd.DataFrame(rows)

        rows= []
        surfacedatas = doc.surfacehistory.find_all("surfacedata")
        for surfacedata in surfacedatas:
            row= {}
            #parsing through every surfacedata element
            time_begin = surfacedata.get("time-begin")
            time_end = surfacedata.get("time-end")
            row={"time-begin":time_begin, "time-end":time_end}

            things = surfacedata.find_all("thing", recursive=False)
            #parsing through every thing in each surfacedata
            for thing in things:
                identity = id2name(thing.get("identity"))
                row.update({"identity":identity})

                locations = thing.find_all("loc ation", recursive=False)
                for location in locations:
                    #parsing through every location for every thing for each surfacedata
                    l_identity = location.get("l_identity")
                    surface = location.getText()
                    row.update({"l_identity":l_identity, "surface":surface})
                    row_copy = row.copy()
                    rows.append(row_copy)
        df2 = pd.DataFrame(rows) # second dataframe containing the information needed

    #merging each dataframe on every loop
    df =pd.merge(df1,df2, left_on="time_id", right_on="time-begin") 
    # then appending it to a list
    df_list.append(df)

# final dataframe created by concatenating each dataframe from each output file
df = pd.concat(df_list)
df

XML 文件的示例如下:

文件1

<file filename="stack_example_1" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone zone_id="10">
        <time>
            <timestamp time_id="1" time="0" temperature="100"/>
            <timestamp time_id="2" time="10.00" temperature="200"/>
        </time>
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l_identity="2"> 1.256</location>
                    <location l_identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l_identity="2"> 1.6</location>
                    <location l_identity="5"> 2.5</location> 
                    <location l_identity="78"> 3.2</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l_identity="17"> 2.4</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

文件2

<file filename="stack_example_2" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone zone_id="11">
        <time>
            <timestamp time_id="1" time="0" temperature="100"/>
            <timestamp time_id="2" time="10.00" temperature="200"/>
        </time>
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="45"> 2.6</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.4</location>
                    <location l-identity="8"> 2.7</location>  
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="9"> 2.8</location>
                    <location l-identity="17"> 1.2</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

使用 file 1file 2 的此代码的输出将是:

zone_id     time       time_id  temperature tid-begin   tid-end    identity  location   surface
10           0          1       100         1           2          1        2           1,256
10           0          1       100         1           2          1        2           2,3
10           0          1       100         1           2          3        2           1,6
10           0          1       100         1           2          3        5           2,5
10           0          1       100         1           2          3        78          3,2
10           10         2       200         2           3          1        17          2,4
11           0          1       100         1           2          1        2           1,6
11           0          1       100         1           2          1        45          2,6
11           0          1       100         1           2          3        2           1,4
11           0          1       100         1           2          3        8           2,7
11           10         2       200         2           3          1        9           2,8
11           10         2       200         2           3          1        17          1,2

这是运行 cProfile 后获得的输出:

      Ordered by: internal time
   List reduced from 6281 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   214204   95.337    0.000   95.340    0.000 C:\Users\anon\Anaconda3\lib\json\decoder.py:343(raw_decode)
   214389   20.685    0.000   21.386    0.000 {built-in method io.open}
   214288   17.945    0.000   17.945    0.000 {built-in method _codecs.charmap_decode}
        1   16.745   16.745  336.360  336.360 .\anon_programm.py:7(<module>)
       10   15.378    1.538  132.814   13.281 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:330(feed)
 10277616   12.975    0.000   44.266    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:555(endData)
   214228   12.504    0.000   30.575    0.000 {method 'read' of '_io.TextIOWrapper' objects}
  3425862   11.257    0.000   75.608    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:223(start)
  6851244   10.806    0.000   19.427    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:589(object_was_parsed)
 17128360    8.580    0.000    8.580    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:158(setup)
  3425862    8.389    0.000    8.694    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:527(popTag)
  5961888    7.170    0.000    7.170    0.000 {method 'keys' of 'dict' objects}
  3425872    7.072    0.000   23.054    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:1152(__init__)
   214200    5.978    0.000  146.468    0.001 .\anon_programm.py:18(id2name)
  3425862    5.913    0.000   61.118    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:691(handle_starttag)
  3425002    4.482    0.000   12.571    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\__init__.py:285(_replace_cdata_list_attribute_values)
  3425862    4.326    0.000   37.251    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:278(end)
  3425862    4.244    0.000   13.552    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:657(_popToTag)
  2751774    4.240    0.000    6.154    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:808(<genexpr>)
  6851244    3.869    0.000    8.629    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:932(__new__)

这是循环中被多次调用的函数:

import functools

@functools.lru_cache(maxsize=1000)
def id2name(id):
    name_Dict = json.loads( open(r"path_to_JSON_file\file.json","r").read() )
    name = ""
    if id.isnumeric():
        partial_id = id[:-1]  
        if partial_id not in name_Dict.keys():
            return id
        if id[-1] == "0":
            return  name_Dict[partial_id]
        else:
            return  name_Dict[partial_id]+"x"+id[-1]
    else:
        return ""
python pandas optimization beautifulsoup nested-loops
1个回答
1
投票

正如您问题的评论中所指出的,大部分时间都花在了 id2name 函数中解码 JSON 上。虽然函数的结果被缓存,但解析的 JSON 对象却没有缓存,这意味着您每次查找新 ID 时都会从磁盘加载 JSON 文件并解析它。

假设您每次加载相同的 JSON 文件,这意味着您应该通过缓存解析的 JSON 对象来立即提高速度。您可以通过如下重构 id2name 函数来做到这一点。

import functools

@functools.lru_cache()
def load_name_dict():
    with open(r"path_to_JSON_file\file.json", "r", encoding="utf-8") as f: 
        return json.load(f)

@functools.lru_cache(maxsize=1000)
def id2name(thing_id):
    if not thing_id.isnumeric():
        return ""
    name_dict = load_name_dict()
    name = name_dict.get(thing_id[:-1])
    if name is None:
        return thing_id
    last_char = thing_id[-1]
    if last_char == "0":
        return name
    else:
        return name + "x" + last_char

请注意,我已重构 id2name 函数,以便在 ID 为非数字时不加载 JSON 对象。我还更改为使用

.get
方法而不是
in
以避免不必要的字典查找。另外,我将
id
更改为
thing_id
,因为 id 是 Python 中的内置函数。

此外,由于您的输入文件似乎是有效的 XML,因此直接使用 lxml 而不是通过 BeautifulSoup 可能会节省更多时间。或者更好的是,您可以使用 pandas.read_xml 将 XML 直接加载到数据框中。不过,需要注意的是;您应该分析生成的代码以检查它实际上运行得更快,而不是相信我的话。众所周知,关于绩效的直觉是不可靠的。您应该始终测量它。

© www.soinside.com 2019 - 2024. All rights reserved.