lxml iterparse 会占用 4GB XML 文件的内存,即使使用了clear() 也是如此

问题描述 投票:0回答:1

该脚本的目的是提取每年出版的文章/书籍的数量,并从 xml 文件 dblp-2023-10-01.xml 中的元素获取此信息。该文件可以在这里找到:https://dblp.org/xml/release/

from lxml import etree

xmlfile = 'dblp-2023-10-01.xml'

doc = etree.iterparse(xmlfile, tag='year', load_dtd=True)
_, root = next(doc)
counter_dict = {}
for event, element in doc:
    print(element, event)
    if element.text not in counter_dict:
        counter_dict[element.text] = 1
    else:
        counter_dict[element.text] += 1
    root.clear() 

当我运行一个小文件的代码时,它运行顺利。让我困惑的是,当我运行 dblp 文件时,它超过了 4GB(文件大小),这对我来说没有意义。

我还尝试运行替代版本,以确保它清除它解析的内容:

    for ancestor in elem.xpath('ancestor-or-self::*'):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]

没有任何改善

python xml lxml ram large-files
1个回答
0
投票

我不知道为什么 lxml 的 iterparse 需要所有内存,但我尝试了一个简单的 SAX 程序

import xml.sax

counter_dict = {}

class YearHandler(xml.sax.ContentHandler):

    def __init__(self):
       self.year = ''
       self.isYear = False

    def startElement(self,tag,attributes):
       if tag == 'year':
           self.isYear = True
           self.year = ''

    def endElement(self,tag):
       if self.isYear and tag == 'year':
           self.isYear = False
           yearInt = int(self.year)
           if yearInt in counter_dict:
               counter_dict[yearInt] += 1
           else:
               counter_dict[yearInt] = 1


    def characters(self,content):
        if self.isYear:
           self.year += content

if __name__=='__main__':

    parser=xml.sax.make_parser()

    parser.setFeature(xml.sax.handler.feature_namespaces,0)

    parser.setContentHandler(YearHandler())

    parser.parse('dblp-2023-10-01.xml')

    print(counter_dict)

在我的 Windows 机器上,它消耗的内存和输出不到 10 MB

{2014: 292279, 2005: 158268, 2011: 250783, 2012: 263294, 2018: 374688, 2008: 203544, 1997: 57099, 2010: 228955, 2016: 314744, 2017: 339456, 2013: 280709, 2002: 97758, 2004: 135735, 2009: 222653, 2007: 189562, 2006: 176458, 1999: 71138, 2015: 302656, 2022: 470484, 2023: 305500, 2019: 417602, 2020: 433127, 1992: 34900, 2021: 456839, 1988: 21633, 1998: 64297, 1986: 16475, 1989: 24001, 1987: 17549, 2001: 86798, 1994: 45290, 1990: 28166, 2003: 116385, 1995: 47712, 2000: 80955, 1993: 40695, 1991: 31084, 1996: 52809, 1954: 225, 1971: 3120, 2024: 536, 1985: 13890, 1984: 12334, 1982: 9939, 1975: 5246, 1983: 10860, 1980: 7787, 1981: 8662, 1964: 1108, 1977: 5961, 1976: 5695, 1972: 3751, 1974: 5007, 1979: 6913, 1973: 4414, 1978: 6786, 1967: 1763, 1965: 1291, 1969: 2113, 1968: 2182, 1970: 2227, 1966: 1503, 1959: 715, 1961: 903, 1953: 173, 1960: 625, 1957: 343, 1955: 213, 1958: 464, 1956: 355, 1951: 46, 1962: 1186, 1952: 114, 1963: 1032, 1946: 31, 1947: 10, 1945: 9, 1939: 18, 1948: 41, 1942: 13, 1949: 52, 1941: 13, 1937: 16, 1940: 10, 1936: 12, 1950: 29, 1943: 8, 1944: 5, 1938: 11}
© www.soinside.com 2019 - 2024. All rights reserved.