如何获得数据帧中每一行的特定单词的频率

Question

我正在尝试创建一个函数，该函数从数据框中获取特定单词的频率。我正在使用Pandas将CSV文件转换为数据框，并使用NLTK将文本标记化。我能够获得整列的计数，但是我很难获取每一行的频率。以下是我到目前为止所做的事情。

import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from collections import defaultdict

words = [
    "robot",
    "automation",
    "collaborative",
    "Artificial Intelligence",
    "technology",
    "Computing",
    "autonomous",
    "automobile",
    "cobots",
    "AI",
    "Integration",
    "robotics",
    "machine learning",
    "machine",
    "vision systems",
    "systems",
    "computerized",
    "programmed",
    "neural network",
    "tech",
]

def analze(file):
    # count = defaultdict(int)
    df = pd.read_csv(file)
    for text in df["Text"]:
        tokenize_text = word_tokenize(text)
        for w in tokenize_text:
            if w in words:
                count[w] += 1


analze("Articles/AppleFilter.csv")
print(count)

输出：

defaultdict(<class 'int'>, {'automation': 283, 'robot': 372, 'robotics': 194, 'machine': 220, 'tech': 41, 'systems': 187, 'technology': 246, 'autonomous': 60, 'collaborative': 18, 'automobile': 6, 'AI': 158, 'programmed': 12, 'cobots': 2, 'computerized': 3, 'Computing': 1})

目标：获取每一行的频率

{'automation': 5, 'robot': 1, 'robotics': 1, ...
{'automobile': 1, 'systems': 1, 'technology': 1,...
{'AI': 1, 'cobots: 1, computerized': 3,....

CVS文件格式：

Title | Text | URL

我尝试了什么：

count = defaultdict(int)
df = pd.read_csv("AppleFilterTest01.csv")
for text in df["Text"].iteritems():
    for row in text:
        print(row)
        if row in words:
            count[w] += 1
print(count)

输出：

defaultdict(<class 'int'>, {})

[如果有人可以提供任何指导，技巧或帮助，我将非常感谢。谢谢。

Answer 1

这里是使用collections.Counter的简单解决方案：

要复制/粘贴的示例：

0,review_body
1,this is the first 8 issues of the series. this is the first 8 issues of the series.
2,I've always been partial to immutable laws. I've always been partial to immutable laws.
3,This is a book about first contact with aliens. This is a book about first contact with aliens.
4,This is quite possibly *the* funniest book. This is quite possibly *the* funniest book.
5,The story behind the book is almost better than your mom. The story behind the book is almost better than your mom.

进口必需品：

import pandas as pd
from collections import Counter

df = pd.read_clipboard(header=0, index_col=0, sep=',')

使用.str.split()，然后用apply() Counter：

df1 = df.review_body.str.split().apply(lambda x: Counter(x))

print(df1)

0
1    {'this': 2, 'is': 2, 'the': 4, 'first': 2, '8'...
2    {'I've': 2, 'always': 2, 'been': 2, 'partial':...
3    {'This': 2, 'is': 2, 'a': 2, 'book': 2, 'about...
4    {'This': 2, 'is': 2, 'quite': 2, 'possibly': 2...
5    {'The': 2, 'story': 2, 'behind': 2, 'the': 2, ...

在dict(Counter(x))内输入apply()，在.to_dict()末尾等，以获取所需的输出格式。

希望有帮助。

如何获得数据帧中每一行的特定单词的频率

问题描述投票：0回答：1

1个回答

最新问题

如何获得数据帧中每一行的特定单词的频率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1