如何让我的Python代码再次运行

Question

我用

for loops

编写了一个Python脚本，旨在从推文中提取元数据，并且最初运行良好。现在，我已将

for loops

替换为

list comprehension

，我的代码抛出了一个错误，我无法真正破译。这是我的代码：

def tweetFeatures(tweet):
        #Count the number of words in each tweet
        wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
        
        #Count the number of characters in each tweet
        chars = [len(tweet.loc[k]) for k in range(len(tweet))]
        
        #Extract the mentions in each tweet
        mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
        
        #Counts the number of mentions in each tweet 
        mention_count = [len(mentions[t]) for t in range(len(mentions))]
        
        #Extracts the hashtags in each tweet    
        hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
        
        #Counts the number of hashtags in each tweet    
        hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
        
        #Extracts the urls in each tweet
        url = [list(re.findall("(?P<url>https?://[^\s]+)",tweet.loc[l])) for l in range(len(tweet))]
        
        #Counts the number of urls in each tweet
        url_count = [len(url[c]) for c in range(len(url))]
        
        #Put everything into a dataframe
        feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
        feats_df = pd.DataFrame(feats)

        return feats_df

这是我运行这行代码后遇到的错误

tweetFeatures(tweet = text_df)

AttributeError                            Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)

<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

<ipython-input-21-36def6dfde04> in <listcomp>(.0)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

这是我创建的测试数据：

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
           "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
           "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
           "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
           "Why can't the #Athletics be more exciting? #Tokyo2020",
           "It is so much fun to see beautful colors at the #Olympics"]

我使用

text_df = pd.DataFrame(text)

将其转换为 Pandas 数据框，然后使用

print(text_df)

进行打印，结果如下：

0
0   @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1   Gabrielle Thomas is my favourite sprinter @Too...
2   @Sports_head I wish the #Tokyo2020 @Olympics w...
3   I hate the attitude of officials at this olymp...
4   Why can't the #Athletics be more exciting? #To...
5   It is so much fun to see beautful colors at th...

代码是在 Jupyter 笔记本中编写的。

Answer 1

根据您的错误消息

AttributeError: 'Series' object has no attribute 'split'

，您正在尝试在

split()

Series对象上调用String方法

pandas

。

wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]

通过查看您提供的测试数据，您可以执行以下操作来修复错误：

import pandas as pd

text_df = pd.DataFrame(text,columns=["tweet"])

text_df.tweet.loc[0].split()

会回来：

['@Harrison2Jennifer',
 'Tokyo',
 '2020',
 'is',
 'so',
 'much',
 'fun.',
 'Loving',
 'every',
 'bit',
 'of',
 'it',
 'just',
 'as',
 '@MeggyJane',
 '&',
 '@Tommy620',
 'say',
 '#Tokyo2020',
 'https://www.corp.com']

或者，有一个没有

pandas

的解决方案，通过传递“原始”推文列表并将列表理解更改为

wordcount = [len(t.split()) for t in tweet]

Answer 2

您正在做的是创建一个

pd.DataFrame

，但您只有一个列。在您的情况下，此列称为

。

因此您可以通过以下任一方式修复代码：

```
tweetFeatures(tweet = text_df[0])
```
创建一个系列而不是 DataFrame：
```
text_df = pd.Series(text)
```
并像您现在正在做的那样调用它。

此外，在大多数情况下，您可以通过使用 apply 来加速您的函数。请注意，对于小输入（例如您提供的示例），这会有点慢，但在使用更多推文时会显着加快速度：

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
       "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
       "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
       "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
       "Why can't the #Athletics be more exciting? #Tokyo2020",
       "It is so much fun to see beautful colors at the #Olympics"]*1000

from functools import partial
def tweetFeatures_speedup(tweet):
    #Count the number of words in each tweet
    wordcount = tweet.apply(lambda x: len(x.split()))
    
    #Count the number of characters in each tweet
    chars = tweet.apply(len)
    
    #Extract the mentions in each tweet
    mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
    
    #Counts the number of mentions in each tweet
    mention_count = tweet.apply(lambda x: len(mention_finder(x)))
    
    #Extracts the hashtags in each tweet    
    #Counts the number of hashtags in each tweet   
    hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
    hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
    
    #Extracts the urls in each tweet
    #Counts the number of urls in each tweet
    url_finder = partial(re.findall, "(?P<url>https?://[^\s]+)")
    url_count = tweet.apply(lambda x: len(url_finder(x)))
    
    #Put everything into a dataframe
    feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
    feats_df = pd.DataFrame(feats)

    return feats_df

这导致

%%timeit

比较：

您的版本：
```
193 ms ± 1.95 ms per loop
```
我的版本：
```
21.3 ms ± 85.1 µs
```

如何让我的Python代码再次运行

问题描述投票：0回答：2

2个回答

最新问题

如何让我的Python代码再次运行

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2