为什么正则表达式实际上存在时正则表达式不给我

Question

我正试图借助正则表达式从Twitter文本中提取名称。但是，尽管有这种模式，但返回的值是none，情况并非如此。我的代码有错误的地方，我不知道。我正在使用Jupyter实验室。

示例文本为pd.Series full_text

0    RT @SeamusHughes: The Taliban Stamp of approva...
1    RT @WFaqiri: Taliban and Afghan groups find co...
2    RT @DavidCornDC: Imagine what Fox News would h...
3    RT @DavidCornDC: Imagine what Fox News would h...
4    RT @billroggio: Even if you are inclined to tr...
5    RT @billroggio: I am sure we will hear the arg...
6    RT @KFILE: This did happen and it went exactly...
Name: full_text, dtype: object

我的函数定义如下：

def extract_user(text):
        m = re.search(r"RT\s@\w+:", text)
        return m

而且，我将上述功能应用如下：

full_text.apply(extract_user)

但是我得到的回报如下：

0        None
1        None
2        None
3        None
4        None
         ... 
21299    None
21300    None
21301    None
21302    None
21303    None
Name: full_text, Length: 21304, dtype: object

Answer 1

如何在其中使用lambda函数呢？

>>> df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))
0    SeamusHughes
1         WFaqiri
2     DavidCornDC
3     DavidCornDC
4      billroggio
5      billroggio
6           KFILE

为了全面起见，将它们放在一起：

import pandas as pd
data = [['RT @SeamusHughes: The Taliban Stamp of approva...'],['RT @WFaqiri: Taliban and Afghan groups find co...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @billroggio: Even if you are inclined to tr...'],['RT @billroggio: I am sure we will hear the arg...'],['RT @KFILE: This did happen and it went exactly...']]
df=pd.DataFrame(data)
df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))

# 0    SeamusHughes
# 1         WFaqiri
# 2     DavidCornDC
# 3     DavidCornDC
# 4      billroggio
# 5      billroggio
# 6           KFILE
# Name: 0, dtype: object

Answer 2

您可以使用下面的代码简单得多

df.A.str.extract(r"(@\w+)") #A is the column name

输出

    0
0   @SeamusHughes
1   @WFaqiri
2   @DavidCornDC
3   @DavidCornDC
4   @billroggio
5   @billroggio
6   @KFILE

如果只需要名称而不需要@符号，请使用df.A.str.extract(r"@(\w+)")

输出

    0
0   SeamusHughes
1   WFaqiri
2   DavidCornDC
3   DavidCornDC
4   billroggio
5   billroggio
6   KFILE

Answer 3

发生这种情况的原因是因为您的函数（extract_user）返回：

0    <re.Match object; span=(5, 22), match='RT @Sea...
1    <re.Match object; span=(5, 17), match='RT @WFa...
2    <re.Match object; span=(5, 21), match='RT @Dav...
3    ...

现在我不是专家，所以请带一点盐，但是我的猜测是熊猫没有dtype来处理函数返回的<re.Match>对象，因此它以None处理]。如果您想深入了解int，请查看this很好的答案；处理的dtypes。

因此，假设您希望通过最小的更改使所有方法保持不变，这是通过简单地返回每个[0]对象的第一项（<re.Match>）修改函数的示例。

def extract_user(text):
         m = re.search(r"RT\s@\w+:", text)
         return m[0]                        # <-- here

stuff = df.iloc[:, 0].apply(extract_user)

print(stuff)

0    RT @SeamusHughes:
1         RT @WFaqiri:
2     RT @DavidCornDC:
3     RT @DavidCornDC:
4      RT @billroggio:
5      RT @billroggio:
6           RT @KFILE:

希望澄清的事情。

为什么正则表达式实际上存在时正则表达式不给我

问题描述投票：0回答：3

3个回答

最新问题

为什么正则表达式实际上存在时正则表达式不给我

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3