如何让我的Python代码再次运行

问题描述 投票:0回答:2

我用

for loops
编写了一个Python脚本,旨在从推文中提取元数据,并且最初运行良好。现在,我已将
for loops
替换为
list comprehension
,我的代码抛出了一个错误,我无法真正破译。这是我的代码:

def tweetFeatures(tweet):
        #Count the number of words in each tweet
        wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
        
        #Count the number of characters in each tweet
        chars = [len(tweet.loc[k]) for k in range(len(tweet))]
        
        #Extract the mentions in each tweet
        mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
        
        #Counts the number of mentions in each tweet 
        mention_count = [len(mentions[t]) for t in range(len(mentions))]
        
        #Extracts the hashtags in each tweet    
        hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
        
        #Counts the number of hashtags in each tweet    
        hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
        
        #Extracts the urls in each tweet
        url = [list(re.findall("(?P<url>https?://[^\s]+)",tweet.loc[l])) for l in range(len(tweet))]
        
        #Counts the number of urls in each tweet
        url_count = [len(url[c]) for c in range(len(url))]
        
        #Put everything into a dataframe
        feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
        feats_df = pd.DataFrame(feats)

        return feats_df

这是我运行这行代码后遇到的错误

tweetFeatures(tweet = text_df)

AttributeError                            Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)

<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

<ipython-input-21-36def6dfde04> in <listcomp>(.0)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

这是我创建的测试数据:

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
           "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
           "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
           "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
           "Why can't the #Athletics be more exciting? #Tokyo2020",
           "It is so much fun to see beautful colors at the #Olympics"]

我使用

text_df = pd.DataFrame(text)
将其转换为 Pandas 数据框,然后使用
print(text_df)
进行打印,结果如下:

0
0   @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1   Gabrielle Thomas is my favourite sprinter @Too...
2   @Sports_head I wish the #Tokyo2020 @Olympics w...
3   I hate the attitude of officials at this olymp...
4   Why can't the #Athletics be more exciting? #To...
5   It is so much fun to see beautful colors at th...

代码是在 Jupyter 笔记本中编写的。

python pandas dataframe list-comprehension
2个回答
1
投票

根据您的错误消息

AttributeError: 'Series' object has no attribute 'split'
,您正在尝试在
split()
Series对象上调用String方法
pandas

wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]

通过查看您提供的测试数据,您可以执行以下操作来修复错误:

import pandas as pd

text_df = pd.DataFrame(text,columns=["tweet"])

text_df.tweet.loc[0].split()

会回来:

['@Harrison2Jennifer',
 'Tokyo',
 '2020',
 'is',
 'so',
 'much',
 'fun.',
 'Loving',
 'every',
 'bit',
 'of',
 'it',
 'just',
 'as',
 '@MeggyJane',
 '&',
 '@Tommy620',
 'say',
 '#Tokyo2020',
 'https://www.corp.com']

或者,有一个没有

pandas
的解决方案,通过传递“原始”推文列表并将列表理解更改为

wordcount = [len(t.split()) for t in tweet]

1
投票

您正在做的是创建一个

pd.DataFrame
,但您只有一个列。在您的情况下,此列称为
0

因此您可以通过以下任一方式修复代码:

  • tweetFeatures(tweet = text_df[0])
  • 创建一个系列而不是 DataFrame:
    text_df = pd.Series(text)
    并像您现在正在做的那样调用它。

此外,在大多数情况下,您可以通过使用 apply 来加速您的函数。请注意,对于小输入(例如您提供的示例),这会有点慢,但在使用更多推文时会显着加快速度:

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
       "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
       "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
       "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
       "Why can't the #Athletics be more exciting? #Tokyo2020",
       "It is so much fun to see beautful colors at the #Olympics"]*1000

from functools import partial
def tweetFeatures_speedup(tweet):
    #Count the number of words in each tweet
    wordcount = tweet.apply(lambda x: len(x.split()))
    
    #Count the number of characters in each tweet
    chars = tweet.apply(len)
    
    #Extract the mentions in each tweet
    mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
    
    #Counts the number of mentions in each tweet
    mention_count = tweet.apply(lambda x: len(mention_finder(x)))
    
    #Extracts the hashtags in each tweet    
    #Counts the number of hashtags in each tweet   
    hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
    hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
    
    #Extracts the urls in each tweet
    #Counts the number of urls in each tweet
    url_finder = partial(re.findall, "(?P<url>https?://[^\s]+)")
    url_count = tweet.apply(lambda x: len(url_finder(x)))
    
    #Put everything into a dataframe
    feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
    feats_df = pd.DataFrame(feats)

    return feats_df

这导致

%%timeit
比较:

  • 您的版本:
    193 ms ± 1.95 ms per loop
  • 我的版本:
    21.3 ms ± 85.1 µs
© www.soinside.com 2019 - 2024. All rights reserved.