我有一个文件夹和其他几个文件夹。这些文件夹都包含 talkbank CHILDES xml 文件
我编写了代码将数据提取到 pandas 数据框中。此代码的工作原理是要求用户输入包含 xml 文件的文件夹的目录,但我希望它以这样的方式工作,即用户在想要提取 xml 文件时不必一直输入目录。我可以以这样的方式编写:在代码中指定所有文件目录,并且一旦 .py 文件运行,它就会从所有文件夹中提取 xml?
import nltk
import os
import pandas as pd
from lxml import etree
from nltk.corpus.reader import CHILDESCorpusReader
# Function to get user input for the directory
def get_input_directory():
dir_childes_corpus = input("Enter the directory containing CHILDES XML files: ")
return dir_childes_corpus
# Path containing CHILDES XML files
dir_childes_corpus = get_input_directory()
# Empty lists to store speaker and utterance data
speakers_data = [ ]
utterance_data = [ ]
# Define namespaces
namespaces = {
'tb': 'http://www.talkbank.org/ns/talkbank',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
for filename in os.listdir(dir_childes_corpus):
if filename.endswith('.xml'):
file_path = os.path.join(dir_childes_corpus, filename)
# Parse XML data
tree = etree.parse(file_path)
# Extract dialogue_id from folder name and filename
folder_name = os.path.basename(os.path.dirname(file_path))
dialogue_id = f"{folder_name}/{os.path.splitext(filename)[0]}.xml"
# Extract participant information
speakers_info = []
for participant in tree.xpath("//tb:Participants/tb:participant", namespaces=namespaces):
speaker_info = {
'dialogue_id': dialogue_id,
'speaker_id': participant.get('id'),
'speaker_name': participant.get('name'),
'role': participant.get('role'),
'age': participant.get('age'),
'sex': participant.get('sex')
}
speakers_info.append(speaker_info)
df_speakers = pd.DataFrame(speakers_info)
# Extract utterance information including morphemes count
utts_info = []
for utt in tree.xpath("//tb:u", namespaces=namespaces):
speaker = utt.get('who')
uID = utt.get('uID') # Extract uID
utterance_text = ' '.join(utt.xpath(".//tb:w/text()", namespaces=namespaces))
# Count morphemes
utterance_length = len(utt.xpath(".//tb:w/tb:mor", namespaces=namespaces))
utt_info = {
'dialogue_id': dialogue_id,
'uID': uID,
'speaker': speaker,
'utterance': utterance_text,
'utterance_length': utterance_length
}
utts_info.append(utt_info)
df_utts = pd.DataFrame(utts_info)
# Append data to lists
speakers_data.append(df_speakers)
utterance_data.append(df_utts)
# Concatenate dataframes
speakers_data = pd.concat(speakers_data, ignore_index=True)
utterance_data = pd.concat(utterance_data, ignore_index=True)
您也可以使用
os.walk()
搜索目录或子目录中的所有 .xml:
import os
for root, dirs, files in os.walk(".", topdown=False):
for name in files:
if name.endswith('.xml'):
# do something ...
print(os.path.join(root, name))