我正试图借助正则表达式从Twitter文本中提取名称。但是,尽管有这种模式,但返回的值是none,情况并非如此。我的代码有错误的地方,我不知道。我正在使用Jupyter实验室。
示例文本为pd.Series full_text
0 RT @SeamusHughes: The Taliban Stamp of approva...
1 RT @WFaqiri: Taliban and Afghan groups find co...
2 RT @DavidCornDC: Imagine what Fox News would h...
3 RT @DavidCornDC: Imagine what Fox News would h...
4 RT @billroggio: Even if you are inclined to tr...
5 RT @billroggio: I am sure we will hear the arg...
6 RT @KFILE: This did happen and it went exactly...
Name: full_text, dtype: object
我的函数定义如下:
def extract_user(text):
m = re.search(r"RT\s@\w+:", text)
return m
而且,我将上述功能应用如下:
full_text.apply(extract_user)
但是我得到的回报如下:
0 None
1 None
2 None
3 None
4 None
...
21299 None
21300 None
21301 None
21302 None
21303 None
Name: full_text, Length: 21304, dtype: object
如何在其中使用lambda函数呢?
>>> df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))
0 SeamusHughes
1 WFaqiri
2 DavidCornDC
3 DavidCornDC
4 billroggio
5 billroggio
6 KFILE
为了全面起见,将它们放在一起:
import pandas as pd
data = [['RT @SeamusHughes: The Taliban Stamp of approva...'],['RT @WFaqiri: Taliban and Afghan groups find co...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @billroggio: Even if you are inclined to tr...'],['RT @billroggio: I am sure we will hear the arg...'],['RT @KFILE: This did happen and it went exactly...']]
df=pd.DataFrame(data)
df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))
# 0 SeamusHughes
# 1 WFaqiri
# 2 DavidCornDC
# 3 DavidCornDC
# 4 billroggio
# 5 billroggio
# 6 KFILE
# Name: 0, dtype: object
您可以使用下面的代码简单得多
df.A.str.extract(r"(@\w+)") #A is the column name
输出
0
0 @SeamusHughes
1 @WFaqiri
2 @DavidCornDC
3 @DavidCornDC
4 @billroggio
5 @billroggio
6 @KFILE
如果只需要名称而不需要@
符号,请使用df.A.str.extract(r"@(\w+)")
输出
0
0 SeamusHughes
1 WFaqiri
2 DavidCornDC
3 DavidCornDC
4 billroggio
5 billroggio
6 KFILE
发生这种情况的原因是因为您的函数(extract_user
)返回:
0 <re.Match object; span=(5, 22), match='RT @Sea...
1 <re.Match object; span=(5, 17), match='RT @WFa...
2 <re.Match object; span=(5, 21), match='RT @Dav...
3 ...
现在我不是专家,所以请带一点盐,但是我的猜测是熊猫没有dtype
来处理函数返回的<re.Match>
对象,因此它以None
处理]。如果您想深入了解int,请查看this很好的答案;处理的dtypes。
因此,假设您希望通过最小的更改使所有方法保持不变,这是通过简单地返回每个[0]
对象的第一项(<re.Match>
)修改函数的示例。
def extract_user(text):
m = re.search(r"RT\s@\w+:", text)
return m[0] # <-- here
stuff = df.iloc[:, 0].apply(extract_user)
print(stuff)
0 RT @SeamusHughes:
1 RT @WFaqiri:
2 RT @DavidCornDC:
3 RT @DavidCornDC:
4 RT @billroggio:
5 RT @billroggio:
6 RT @KFILE:
希望澄清的事情。