使用正则表达式解析多个文本字段并编译为Pandas DataFrame

问题描述 投票:1回答:1

我正在尝试使用python和regex解析文本文件以构建特定的pandas数据框。以下是我正在解析的文本文件中的示例以及我正在寻找的理想的熊猫DataFrame。

Sample Text

Washington, DC  November 27, 2019
USDA Truck Rate Report

WA_FV190 

FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019                                                                                   
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019                                                                                    

PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().                                                                     

In areas where rates are based on package rates, per-load rates were                                                                
derived by multiplying the package rate by the number of packages in                                                                
the most usual load in a 48-53 foot trailer.

CENTRAL AND WESTERN ARIZONA                                                                                                         
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE   SLIGHT SHORTAGE 
--                                                    

ATLANTA           5100   5500                                                                                                       
BALTIMORE         6300   6600                                                                                                       
BOSTON            7000   7300                                                                                                       
CHICAGO           4500   4900                                                                                                       
DALLAS            3400   3800                                                                                                       
MIAMI             6400   6700                                                                                                       
NEW YORK          6600   6900                                                                                                       
PHILADELPHIA      6400   6700 

                   2019           2018                                                                                              

              NOV 17-23      NOV 18-24                                                                                              

U.S.             25,701         22,956                                                                                              
IMPORTS          13,653         15,699                                                                                              
           ------------ --------------                                                                                              
sum              39,354         38,655 

理想的输出应类似于:

Region                        CommodityGroup          InboundCity  Low   High   
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   ATLANTA      5100  5500
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   BALTIMORE    6300  6600
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   BOSTON       7000  7300
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   CHICAGO      4500  4900
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   DALLAS       3400  3800
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   MIAMI        6400  6700
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   NEW YORK     6600  6900
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   PHILADELPHIA 6400  6700

由于我对创建正则表达式的了解有限,这是我成功隔离所需文本的最接近的位置:regex tester for USDA data

我一直在尝试从How to parse complex text files using Python? 1复制该解决方案,但严重缺乏我的正则表达式经验。您能提供的任何帮助将不胜感激!

python regex pandas
1个回答
0
投票

我想出了this regextxt是您的问题文字):

import re
import numpy as np
import pandas as pd

data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):

    for val in values.strip().splitlines():
        val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
        inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
        data['Region'].append(region)
        data['CommodityGroup'].append(commodity_group)
        data['InboundCity'].append(inbound_city)
        data['Low'].append(np.nan if low == '' else int(low))
        data['High'].append(int(high))

df = pd.DataFrame(data)
print(df)

打印:

                        Region                                     CommodityGroup   InboundCity   Low  High
0  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...       ATLANTA  5100  5500
1  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...     BALTIMORE  6300  6600
2  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...        BOSTON  7000  7300
3  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...       CHICAGO  4500  4900
4  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...        DALLAS  3400  3800
5  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...         MIAMI  6400  6700
6  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...      NEW YORK  6600  6900
7  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...  PHILADELPHIA  6400  6700

编辑:现在即使对于来自regex101的大文档也应适用

© www.soinside.com 2019 - 2024. All rights reserved.