我用
for loops
编写了一个Python脚本,旨在从推文中提取元数据,并且最初运行良好。现在,我已将 for loops
替换为 list comprehension
,我的代码抛出了一个错误,我无法真正破译。这是我的代码:
def tweetFeatures(tweet):
#Count the number of words in each tweet
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
#Count the number of characters in each tweet
chars = [len(tweet.loc[k]) for k in range(len(tweet))]
#Extract the mentions in each tweet
mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
#Counts the number of mentions in each tweet
mention_count = [len(mentions[t]) for t in range(len(mentions))]
#Extracts the hashtags in each tweet
hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
#Counts the number of hashtags in each tweet
hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
#Extracts the urls in each tweet
url = [list(re.findall("(?P<url>https?://[^\s]+)",tweet.loc[l])) for l in range(len(tweet))]
#Counts the number of urls in each tweet
url_count = [len(url[c]) for c in range(len(url))]
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
这是我运行这行代码后遇到的错误
tweetFeatures(tweet = text_df)
AttributeError Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)
<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
<ipython-input-21-36def6dfde04> in <listcomp>(.0)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'split'
这是我创建的测试数据:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]
我使用
text_df = pd.DataFrame(text)
将其转换为 Pandas 数据框,然后使用 print(text_df)
进行打印,结果如下:
0
0 @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1 Gabrielle Thomas is my favourite sprinter @Too...
2 @Sports_head I wish the #Tokyo2020 @Olympics w...
3 I hate the attitude of officials at this olymp...
4 Why can't the #Athletics be more exciting? #To...
5 It is so much fun to see beautful colors at th...
代码是在 Jupyter 笔记本中编写的。
根据您的错误消息
AttributeError: 'Series' object has no attribute 'split'
,您正在尝试在split()
Series对象上调用String方法pandas
。
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
通过查看您提供的测试数据,您可以执行以下操作来修复错误:
import pandas as pd
text_df = pd.DataFrame(text,columns=["tweet"])
text_df.tweet.loc[0].split()
会回来:
['@Harrison2Jennifer',
'Tokyo',
'2020',
'is',
'so',
'much',
'fun.',
'Loving',
'every',
'bit',
'of',
'it',
'just',
'as',
'@MeggyJane',
'&',
'@Tommy620',
'say',
'#Tokyo2020',
'https://www.corp.com']
或者,有一个没有
pandas
的解决方案,通过传递“原始”推文列表并将列表理解更改为
wordcount = [len(t.split()) for t in tweet]
您正在做的是创建一个
pd.DataFrame
,但您只有一个列。在您的情况下,此列称为 0
。
因此您可以通过以下任一方式修复代码:
tweetFeatures(tweet = text_df[0])
text_df = pd.Series(text)
并像您现在正在做的那样调用它。此外,在大多数情况下,您可以通过使用 apply 来加速您的函数。请注意,对于小输入(例如您提供的示例),这会有点慢,但在使用更多推文时会显着加快速度:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]*1000
from functools import partial
def tweetFeatures_speedup(tweet):
#Count the number of words in each tweet
wordcount = tweet.apply(lambda x: len(x.split()))
#Count the number of characters in each tweet
chars = tweet.apply(len)
#Extract the mentions in each tweet
mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
#Counts the number of mentions in each tweet
mention_count = tweet.apply(lambda x: len(mention_finder(x)))
#Extracts the hashtags in each tweet
#Counts the number of hashtags in each tweet
hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
#Extracts the urls in each tweet
#Counts the number of urls in each tweet
url_finder = partial(re.findall, "(?P<url>https?://[^\s]+)")
url_count = tweet.apply(lambda x: len(url_finder(x)))
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
这导致
%%timeit
比较:
193 ms ± 1.95 ms per loop
21.3 ms ± 85.1 µs