Python程序从单词列表中提取txt文件的各个部分

问题描述 投票:0回答:1

我想要一个应该打印文本文件各部分的python程序。该部分由从单词列表中找到的关键字定义,并从该关键字所在的行开始,并在下一部分开始的那一行结束。例如考虑以下文本文件

word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi
jkdkjd
word89
eyuiywiou299092    
word3
...
...
...
程序的

必需输出是:

Sections Found: [word1, word2, word3, word5, word89]

**********word1--SECTION**********
line 1: word1
line 2: abcdef
line 3: ghis jsd sjdhd jshj

**********word2--SECTION**********
line 4: word2
line 5: dgjgj dhkjhf
line 6: khkhkjd

**********word3--SECTION**********
line 14: word 3
line 15: ....

''' Suppose word4 is not found in the txt file then it should continue and move to next word found''' 
**********word5--SECTION**********
line 9: word5
line 10: diow299 udhgbhdi
line 11: jkdkjd

...
...
...
...

'''Continue till the end of list of words '''

方法:

list_of_words = ['word1','word2','word3','word4','word5','word6',....]

在list_of_word中找到每个单词的起始行并将它们存储在列表中

然后通过对列表进行排序来找到每个单词的end_line,以便轻松找到单词的最大近端行

然后打印找到的部分及其行号:line_in_text_file

用于获取行号的代码:(如何为list_of_words中的每个n创建变量)

for n in list_of_words:
    with open(file_txt, 'r', encoding="utf8") as f:
        data_file = f.readlines()
    for num, lines in enumerate(data_file, 1):
        if n in lines:
            start_line = num
        else:
            continue

用于查找最接近起始行列表n_start_line(val)的数字的代码:

def closest(array_list, val):
    array_list1 = [j for j in array_list if j > val]
    array_list1.sort()
    return array_list1[0]
python parsing pyparsing data-extraction
1个回答
0
投票

pyparsing具有生成器函数scanString,它将生成匹配的令牌以及匹配的开始和结束位置。使用起始位置,调用pyparsing的lineno方法以获取匹配的行号。

import pyparsing as pp

marker = pp.oneOf("word1 word2 word3 word4 word5 word23")

txt = """\
word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi word2
jkdkjd
word89
eyuiywiou299092    
word3
"""

previous = None
for t, s, e in (pp.LineStart() + marker | pp.StringEnd()).scanString(txt):
    current_line_number = pp.lineno(s, txt)
    if t:
        current = t[0]
        if previous is not None:
            print(previous, "ended on line", current_line_number - 1)
        print("found", current, "on line", current_line_number)
        previous = current
    else:
        if previous is not None:
            print(previous, "ended on line", current_line_number)

打印:

found word1 on line 1
word1 ended on line 3
found word2 on line 4
word2 ended on line 6
found word23 on line 7
word23 ended on line 8
found word5 on line 9
word5 ended on line 13
found word3 on line 14
word3 ended on line 15

您应该可以从这里拿走它。

© www.soinside.com 2019 - 2024. All rights reserved.