我有一个将字符串分割成单词然后在数据框中找到单词的函数,如果找到它,则使用for循环搜索该行,我不想这样做,因为它会使大型数据集变得太慢。我想使用row [value],并且不想为每个匹配的单词遍历整个df。
我是python的新手,我已经搜索了很多东西,但是可以得到我想要的东西,我找到了index.tolist()但我不想列出一个列表,我只需要第一个匹配值的索引即可。
任何帮助或解决方法将不胜感激。
def cal_nega_mean(my_string):
mean = 0.00
mean_tot = 0
mean_sum = 0.00
for word in my_string.split():
if word in df.values: #at this point if it founds then get index, so that i dont have to use for loop in next line
for index, row in df.iterrows(): #want to change
if word == row.word: # this part
if row['value'] < -0.40:
mean_tot += 1
mean += row['value']
break
if mean_tot == 0:
return 0
mean = mean_sum / mean_tot
return round(mean,2)
示例字符串输入,有超过30万个字符串
my_string = "i have a problem with my python code"
cal_nega_mean(my_string)
# and i am using this to get return for all records
df_tweets['intensity'] = df_tweets['tweets'].apply(lambda row: cal_nega_mean(row))
要搜索的数据框
df
index word value ...
1 python -0.56
2 problem -0.78
3 alpha -0.91
. . .
9000 last -0.41
您可以尝试使用i = df[df.word == word].index[0]
获得满足条件df.word == word
的第一行的索引。有了索引后,可以使用df.loc
将该行切出。
def cal_nega_mean(my_string):
mean = 0.00
mean_tot = 0
mean_sum = 0.00
for word in my_string.split():
try:
i = df[df.word == word].index[0]
except:
continue
row = df.loc[i]
if row['value'] < -0.40:
mean_tot += 1
mean += row['value']
break
if mean_tot == 0:
return 0
mean = mean_sum / mean_tot
return round(mean,2)
这是使用字典的一种方法,您可以将word: value
转换为键,值存储并将其用作查找:
word_look_up = dict(zip(df['word'], df['value']))
def cal_nega_mean(my_string):
mean = 0.0
mean_tot = 0
mean_sum = 0.00
words = [word for word in my_string.split() if word in word_look_up]
if not any(words): # if no word found
return 0
else:
for word in words:
value = word_look_up[word]
if value < -0.40:
mean_tot += 1
mean += value
break
mean = mean / mean_tot
return round(mean, 2)
df['intensity'] = df['word'].apply(cal_nega_mean)