Python |正则表达式分裂行;不是专栏

问题描述 投票:2回答:1

我有一个包含5个嵌套行的数据帧(全部包含以下数据)

1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94

我想要做的是分成新的行;不是专栏。

我尝试过这样的事情:

df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True)
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).melt()
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).stack().to_frame()

正则表达式在每个新等级分裂(EG:2The,3Get,4The)。我只是希望拆分创建新行,而不是列。正则表达式需要一些工作,但我很乐意自己解决这个问题。

我可以融合数据框来创建行,但随后清理变得非常耗时(如果没有其他方法,很高兴沿着这条路走下去)。

堆叠更接近,但它分成不同的行(这自然与我的正则表达式有关)。这感觉最接近,但我找不到正则表达式来捕捉这个[还]。

理想的结果如下,但我真正需要的是Title和Gross

Rank      Title         Studio      Gross         Theatres       Date
1         IT            WB          $327,481,748  4,138          9/8/17
2         The Exorcist  WB          $232,906,145  NA             12/26/73

以下内容更加接近

df["Box_Office"].str.split(r'(\$[0-9,/]*)', expand=True).stack().to_frame()

enter image description here

提取或拆分是否可以跨行扩展,而不是跨列?

python regex pandas text
1个回答
0
投票

这是我要做的:

(?P<title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<studio>WB|Par|Art|Uni)
[^$]*
(?P<gross>\$\d+(?:,\d{3})*)
(?P<theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})


Which in Python would be:
import pandas as pd, re

junk = """
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94"""

rx = re.compile(r'''
(?P<Title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<Studio>WB|Par|Art|Uni)
[^$]*
(?P<Gross>\$\d+(?:,\d{3})*)
(?P<Theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<Date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})''', re.VERBOSE)

def replacer(d):
    d['Title'] = d['Title'].replace('\n', ' ')
    return d

records = (replacer(m.groupdict()) for m in rx.finditer(junk))
df = pd.DataFrame(records)

# reorder the columns if necessary
df = df[['Title', 'Studio', 'Gross', 'Theatres', 'Date']]
print(df)


This yields
                        Title Studio         Gross Theatres      Date
0                          It     WB  $327,481,748    4,148    9/8/17
1                The Exorcist     WB  $232,906,145    -n/a-  12/26/73
2                     Get Out    Uni  $176,040,665    3,143  12/24/17
3     The Blair Witch Project    Art  $140,539,099    2,538   7/16/99
4               The Conjuring     WB  $137,400,141    3,115   7/19/13
5         Paranormal Activity    Par  $107,918,810    2,712   9/25/09
6  Interview with the Vampire     WB  $105,264,608    2,604  11/11/94

a demo for the expression on regex101.com


As for your original question: you could extract columns and then transpose the dataframe (like turn it around). However, wherefrom do you get this data in the first place? Scraped from somehwere? You might want to rethink this step!
© www.soinside.com 2019 - 2024. All rights reserved.