如何使用 python 使用多流解析 .xml.bz2 格式的大型维基百科转储,这样我就不必打开整个文件?

问题描述 投票:0回答:2

Here是有关维基百科转储以及如何使用多流的文章的链接,这样我就不必打开整个文件来解析它。 这里是它建议使用的库。

我的问题是我不知道如何正确使用索引文件或该库来正确解析文件。当我尝试解压缩它时,我只是读取了一系列空字节“b”。我想要做的是能够一次解析文件几千个字符,以便我可以将它们使用到我的 NLP 应用程序中。

提前致谢。

python-3.x xml large-files file-read bz2
2个回答
1
投票

我从 wikidump 链接找到了一些代码!

代码链接是:https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream

如果您耐心阅读 wikiarticles.py 脚本,您会发现下面的代码片段:

def retrieve_text(self, title, offset):
    '''
    retrieve the page text for a given title from the xml file
    this does decompression of a bz2 stream so it's more expsive than
    other parts of this class
    arguments:
    title  -- the page title, with spaces and not underscores, case sensitive
    offset -- the offset in bytes to the bz2 stream in the xml file which contains
              the page text
    returns the page text or None if no such page was found
    '''
    self.xml_fd.seek(offset)
    unzipper = bz2.BZ2Decompressor()
    out = None
    found = False
    try:
        block = self.xml_fd.read(262144)
        out = unzipper.decompress(block)
    # hope we got enough back to have the page text
    except:
        raise
    # format of the contents (and there are multiple pages per stream):
    #   <page>
    #   <title>AccessibleComputing</title>
    #   <ns>0</ns>
    #   <id>10</id>
    # ...
    #   </page>

所以对于你的问题,也许你应该执行以下步骤:

  1. 读取wiki-multistream-index.txt.gz文件,检索文章的偏移量
  2. 通过偏移量寻找 wiki-multistream-pages.txt.gz 文件
  3. 处理您从步骤 2 中读取的字节块
  4. 重复步骤2和3,处理wikidump文件中的所有页面

第2步注意,你应该首先使用python open函数获取文件描述符,并将fd ref保存到变量中,然后你可以使用fd.seek(offset)跳转到偏移量,并且你必须调用fd.read( block_size_bytes) 来读取页面数据。

再次阅读 wikiarticles.py,你就会得到答案。


0
投票

问题提出已经快四年了......

这是基于 KevinLoveCherry 的答案中提到的代码的实现。索引文件为您提供偏移量、页面 ID 和页面标题。此代码读取多流转储文件并提取所需文章的维基文本。此实现在我的 2022 年笔记本电脑上运行大约需要 0.1 秒。

调用函数

get_wikitext()
获取文章文本。传入
offset
page_id
,或者
offset
和页面
title

import xml.etree.ElementTree as ET
import bz2

def get_wikitext(dump_filename, offset, page_id=None, title=None, namespace_id=None, verbose=True, block_size=256*1024):
    """Extract wikitext from a multistream dump file.                                                                                                                                                        
                                                                                                                                                                                                             
    Requires the offset (in bytes) from the start of the dump file.                                                                                                                                          
    This can be obtained from the index file.                                                                                                                                                                
                                                                                                                                                                                                             
    Pass in some of either the page_id, namespace_id, or title of the                                                                                                                                        
    page you're looking for.                                                                                                                                                                                 
                                                                                                                                                                                                             
    """
    unzipper = bz2.BZ2Decompressor()

    # Read the compressed stream, decompress the data                                                                                                                                                        
    uncompressed_data = b""
    with open(dump_filename, "rb") as infile:
        infile.seek(int(offset))

        while True:
            compressed_data = infile.read(block_size)
            try:
                uncompressed_data += unzipper.decompress(compressed_data)
            except EOFError:
                # We've reached the end of the stream                                                                                                                                                        
                break
            # If there's no more data in the file                                                                                                                                                            
            if compressed_data == '':
        # End if we've finished reading the stream                                                                                                                                                   
        if unzipper.need_input:
                    break
        # Otherwise we've failed to correctly read all of the stream                                                                                                                                 
        raise Exception("Failed to read a complete stream")

    # Extract out the page                                                                                                                                                                                   
    # Format of the contents (and there are multiple pages per stream):                                                                                                                                      
    #   <page>                                                                                                                                                                                               
    #   <title>AccessibleComputing</title>                                                                                                                                                                   
    #   <ns>0</ns>                                                                                                                                                                                           
    #   <id>10</id>                                                                                                                                                                                          
    # ...                                                                                                                                                                                                    
    #   </page>                                                                                                                                                                                              

    uncompressed_text = uncompressed_data.decode("utf-8")
    xml_data = "<root>" + uncompressed_text + "</root>"
    root = ET.fromstring(xml_data)
    for page in root.findall("page"):
        if title is not None:
            if title != page.find("title").text:
                continue
        if namespace_id is not None:
            if namespace_id != int(page.find("ns").text):
                continue
        if page_id is not None:
            if page_id != int(page.find("id").text):
                continue
        # We've found what we're looking for                                                                                                                                                                 
        revision = page.find("revision")
        wikitext = revision.find("text")
        return wikitext.text

    # We failed to find what we were looking for                                                                                                                                                             
    return None


def example():
    index_line = "600:12:Anarchism"
    offset, page_id, title = index_line.split(":")
    dump_file = "enwiki-dump/enwiki-20231101-pages-articles-multistream.xml.bz2"

    wikitext = get_wikitext(dump_file, int(offset), page_id=int(page_id))
    print(wikitext)
© www.soinside.com 2019 - 2024. All rights reserved.