从 DF 中检查特定日期范围内单词的出现频率

Question

我有一个 DF，其中包含“文本”和“日期”列（范围为 1 周）。我也有一个单词计数器，S.

我想做以下事情：

我需要以下格式的输出数组：

words in S       date1   date2   date3  ... date7

word1          n11     n12     n13        n14

word2          n21     n22     n23        n24

其中

n_ij

是

word_i

发生在

date_j

的次数。

据此我需要确定计数

S[word_i]

是否在所有日期中平均分布。如果是这样，我想提取那些在所有日期中平均分布的单词的子集。

Answer 1

您可以使用 Pandas 来实现这一点。试试这个：

导入必要的库并创建示例数据框
创建一个输出数据框，其中 S 中的唯一词作为索引，日期范围作为列
遍历原始数据帧并更新输出数据帧中的计数
确定计数是否在所有日期之间均匀分布并提取单词子集

代码如下：

import pandas as pd
from collections import Counter

# Sample dataframe
data = {"text": ["word1 word2", "word1 word3", "word2 word3"],
        "date": ["2023-03-17", "2023-03-18", "2023-03-19"]}
df = pd.DataFrame(data)

# Counter of words S
S = Counter(["word1", "word2", "word3"])

# Create an output dataframe with words in S as index and date range as columns
date_range = pd.date_range(start="2023-03-17", end="2023-03-23")
output_df = pd.DataFrame(index=S.keys(), columns=date_range)
output_df.fillna(0, inplace=True)

# Update the counts in the output dataframe
for index, row in df.iterrows():
    words = row['text'].split()
    date = pd.to_datetime(row['date'])
    for word in words:
        if word in output_df.index:
            output_df.at[word, date] += 1

# Determine if the counts are equally distributed and extract a subset of words
subset = []
threshold = sum(S.values()) // len(date_range)
for word in output_df.index:
    counts = output_df.loc[word].values
    if all(count == threshold for count in counts):
        subset.append(word)

print("Subset of words with counts distributed equally among all dates:")
print(subset)

Answer 2

IIUC：

#sample dataframe
data = {"text": ["words1 words2", "words1 words3", "words2 words3"],
        "date": ["2023-03-17", "2023-03-18", "2023-03-19"]}
df = pd.DataFrame(data)
#list of words you only want to get count:
l=["words1","words2","words3"]

尝试

str.split()

+

explode()

：

df=df.assign(text=df["text"].str.split()).explode("text")
df=df[df["text"].isin(l)]  #filter out only those words

在那之后使用

pd.crosstab()

以下

fillna()

df=pd.crosstab(df["text"],df["date"],df["text"],aggfunc='count').fillna(0,downcast='infer')

输出：

date    2023-03-17  2023-03-18  2023-03-19
text            
words1      1           1           0
words2      1           0           1
words3      0           1           1

注意： 如果您希望索引名称和列名称标题清晰，请使用

df=df.rename_axis(index=None,columns=None)

从 DF 中检查特定日期范围内单词的出现频率

问题描述投票：0回答：2

2个回答

最新问题

从 DF 中检查特定日期范围内单词的出现频率

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2