为什么正则表达式实际上存在时正则表达式不给我

问题描述 投票:0回答:3

我正试图借助正则表达式从Twitter文本中提取名称。但是,尽管有这种模式,但返回的值是none,情况并非如此。我的代码有错误的地方,我不知道。我正在使用Jupyter实验室。

示例文本为pd.Series full_text

0    RT @SeamusHughes: The Taliban Stamp of approva...
1    RT @WFaqiri: Taliban and Afghan groups find co...
2    RT @DavidCornDC: Imagine what Fox News would h...
3    RT @DavidCornDC: Imagine what Fox News would h...
4    RT @billroggio: Even if you are inclined to tr...
5    RT @billroggio: I am sure we will hear the arg...
6    RT @KFILE: This did happen and it went exactly...
Name: full_text, dtype: object

我的函数定义如下:

def extract_user(text):
        m = re.search(r"RT\s@\w+:", text)
        return m  

而且,我将上述功能应用如下:

full_text.apply(extract_user)

但是我得到的回报如下:

0        None
1        None
2        None
3        None
4        None
         ... 
21299    None
21300    None
21301    None
21302    None
21303    None
Name: full_text, Length: 21304, dtype: object
python regex pandas series
3个回答
1
投票

如何在其中使用lambda函数呢?

>>> df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))
0    SeamusHughes
1         WFaqiri
2     DavidCornDC
3     DavidCornDC
4      billroggio
5      billroggio
6           KFILE

为了全面起见,将它们放在一起:

import pandas as pd
data = [['RT @SeamusHughes: The Taliban Stamp of approva...'],['RT @WFaqiri: Taliban and Afghan groups find co...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @billroggio: Even if you are inclined to tr...'],['RT @billroggio: I am sure we will hear the arg...'],['RT @KFILE: This did happen and it went exactly...']]
df=pd.DataFrame(data)
df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))

# 0    SeamusHughes
# 1         WFaqiri
# 2     DavidCornDC
# 3     DavidCornDC
# 4      billroggio
# 5      billroggio
# 6           KFILE
# Name: 0, dtype: object

1
投票

您可以使用下面的代码简单得多

df.A.str.extract(r"(@\w+)") #A is the column name

输出

    0
0   @SeamusHughes
1   @WFaqiri
2   @DavidCornDC
3   @DavidCornDC
4   @billroggio
5   @billroggio
6   @KFILE

如果只需要名称而不需要@符号,请使用df.A.str.extract(r"@(\w+)")

输出

    0
0   SeamusHughes
1   WFaqiri
2   DavidCornDC
3   DavidCornDC
4   billroggio
5   billroggio
6   KFILE

1
投票

发生这种情况的原因是因为您的函数(extract_user)返回:

0    <re.Match object; span=(5, 22), match='RT @Sea...
1    <re.Match object; span=(5, 17), match='RT @WFa...
2    <re.Match object; span=(5, 21), match='RT @Dav...
3    ...

现在我不是专家,所以请带一点盐,但是我的猜测是熊猫没有dtype来处理函数返回的<re.Match>对象,因此它以None处理]。如果您想深入了解int,请查看this很好的答案;处理的dtypes。

因此,假设您希望通过最小的更改使所有方法保持不变,这是通过简单地返回每个[0]对象的第一项(<re.Match>)修改函数的示例。

def extract_user(text):
         m = re.search(r"RT\s@\w+:", text)
         return m[0]                        # <-- here

stuff = df.iloc[:, 0].apply(extract_user)

print(stuff)

0    RT @SeamusHughes:
1         RT @WFaqiri:
2     RT @DavidCornDC:
3     RT @DavidCornDC:
4      RT @billroggio:
5      RT @billroggio:
6           RT @KFILE:

希望澄清的事情。

© www.soinside.com 2019 - 2024. All rights reserved.