解析 xml 文件并将其读取到 pandas 数据帧中

问题描述 投票:0回答:1

我有一个文件夹和其他几个文件夹。这些文件夹都包含 talkbank CHILDES xml 文件

我编写了代码将数据提取到 pandas 数据框中。此代码的工作原理是要求用户输入包含 xml 文件的文件夹的目录,但我希望它以这样的方式工作,即用户在想要提取 xml 文件时不必一直输入目录。我可以以这样的方式编写:在代码中指定所有文件目录,并且一旦 .py 文件运行,它就会从所有文件夹中提取 xml?

import nltk
import os
import pandas as pd
from lxml import etree
from nltk.corpus.reader import CHILDESCorpusReader

# Function to get user input for the directory

def get_input_directory():
dir_childes_corpus = input("Enter the directory containing CHILDES XML files: ")
return dir_childes_corpus

# Path containing CHILDES XML files

dir_childes_corpus = get_input_directory()

# Empty lists to store speaker and utterance data

speakers_data = [ ]
utterance_data = [ ]

# Define namespaces

namespaces = {
'tb': 'http://www.talkbank.org/ns/talkbank',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

for filename in os.listdir(dir_childes_corpus):
if filename.endswith('.xml'):
file_path = os.path.join(dir_childes_corpus, filename)

        # Parse XML data
        tree = etree.parse(file_path)
        
        # Extract dialogue_id from folder name and filename
        folder_name = os.path.basename(os.path.dirname(file_path))
        dialogue_id = f"{folder_name}/{os.path.splitext(filename)[0]}.xml"
    
    
        # Extract participant information
        speakers_info = []
        for participant in tree.xpath("//tb:Participants/tb:participant", namespaces=namespaces):
            speaker_info = {
                'dialogue_id': dialogue_id,
                'speaker_id': participant.get('id'),
                'speaker_name': participant.get('name'),
                'role': participant.get('role'),
                'age': participant.get('age'),
                'sex': participant.get('sex')
            }
            speakers_info.append(speaker_info)
        
        df_speakers = pd.DataFrame(speakers_info)
    
        # Extract utterance information including morphemes count
        utts_info = []
        for utt in tree.xpath("//tb:u", namespaces=namespaces):
            speaker = utt.get('who')
            uID = utt.get('uID')  # Extract uID
            utterance_text = ' '.join(utt.xpath(".//tb:w/text()", namespaces=namespaces))
            
           # Count morphemes
            utterance_length = len(utt.xpath(".//tb:w/tb:mor", namespaces=namespaces))
            
            utt_info = {
                'dialogue_id': dialogue_id,
                'uID': uID,
                'speaker': speaker,
                'utterance': utterance_text,
                'utterance_length': utterance_length
            }
            utts_info.append(utt_info)
        df_utts = pd.DataFrame(utts_info)
    
        # Append data to lists
        speakers_data.append(df_speakers)
        utterance_data.append(df_utts)

# Concatenate dataframes

speakers_data = pd.concat(speakers_data, ignore_index=True)
utterance_data = pd.concat(utterance_data, ignore_index=True)
python xml nlp
1个回答
0
投票

您也可以使用

os.walk()
搜索目录或子目录中的所有 .xml:

import os

for root, dirs, files in os.walk(".", topdown=False):
   for name in files:
       if name.endswith('.xml'):
           # do something ...
          print(os.path.join(root, name))
© www.soinside.com 2019 - 2024. All rights reserved.