如何迭代包含发票及其项目列表的非结构化 txt 文件并创建 pandas 数据框

Question

伙计们我有一个 txt 文件，其中包含数百张发票及其项目的列表，其中项目从 1 到 10 不等，具体取决于发票。 TXT 文件大致如下所示：

-----报告--------

发票..：2000

产品条码价格

阿司匹林

001001

5.00

总计：5.00

发票..：2001年

产品条码价格

氯沙坦

005001

5.00

维生素

002111

10.00

总计：15.00

大多数发票只有一项，对于这种情况，我的代码使用以下代码获取所有数据：

for idx, x in enumerate(lines):
    if 'Invoice..:' in x:
        invoice.append(re.search(r'(?<=Invoice..:).*', x).group(0))
    elif 'Product' in x:
        product.append(lines[idx+1])
        barcode.append(lines[idx+2])
        price.append(lines[idx+3])

当发票中有两个或多个项目时，我不知道如何正确获取产品、条形码和价格。

在和处，我想要一个像这样的数据框：

发票	产品	条形码	价格
2000	阿司匹林	001001	5.00
2001	氯沙坦	005001	5.00
2001	维生素	002111	10.00

Answer 1

尝试：

import pandas as pd

# Function to read txt file in a list
def read_txt(path):
    with open(path, "r") as text_file:
        text = text_file.read().split('\n\n')
    return text
# Read txt file
lines = read_txt("file.txt")
# Create a list of index where Invoice..: is present
inv_lst = []
for i, x in enumerate(lines):
    if 'Invoice..:' in x:
        inv_lst.append(i)

# Create a dictionary of Invoice, Product, Barcode, Price
data= {'Invoice':[], 'Product':[], 'Barcode':[], 'Price':[]}
#loop through the list of index and split the list based on index
for i in inv_lst:
    inv_num = lines[i].split(' ')[1]
    invoice = lines[i:inv_lst[i+1] if i+1 < len(inv_lst) else len(lines)]
    #loop through the invoice list and fill the dictionary
    for item in invoice[2:]:
        if 'Total' not in item:
            if item.isupper(): #check if item is a product
                data['Invoice'].append(inv_num)
                data['Product'].append(item)
            elif item.isnumeric(): #check if item is a barcode
                barcode = item
                data['Barcode'].append(item)
            else : #otherwise item is a price
                data['Price'].append(item)
        

# Create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df)

输出：

发票产品条码价格

0 2000 阿司匹林 001001 5.00

1 2001 氯沙坦 005001 5.00

2 2001 维生素 002111 10.00

如何迭代包含发票及其项目列表的非结构化 txt 文件并创建 pandas 数据框

问题描述投票：0回答：1

1个回答

最新问题

如何迭代包含发票及其项目列表的非结构化 txt 文件并创建 pandas 数据框

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1