我有一个 DF,其中包含“文本”和“日期”列(范围为 1 周)。我也有一个单词计数器,S.
我想做以下事情:
我需要以下格式的输出数组:
words in S date1 date2 date3 ... date7
word1 n11 n12 n13 n14
word2 n21 n22 n23 n24
其中
n_ij
是word_i
发生在date_j
的次数。
据此我需要确定计数
S[word_i]
是否在所有日期中平均分布。如果是这样,我想提取那些在所有日期中平均分布的单词的子集。
您可以使用 Pandas 来实现这一点。试试这个:
代码如下:
import pandas as pd
from collections import Counter
# Sample dataframe
data = {"text": ["word1 word2", "word1 word3", "word2 word3"],
"date": ["2023-03-17", "2023-03-18", "2023-03-19"]}
df = pd.DataFrame(data)
# Counter of words S
S = Counter(["word1", "word2", "word3"])
# Create an output dataframe with words in S as index and date range as columns
date_range = pd.date_range(start="2023-03-17", end="2023-03-23")
output_df = pd.DataFrame(index=S.keys(), columns=date_range)
output_df.fillna(0, inplace=True)
# Update the counts in the output dataframe
for index, row in df.iterrows():
words = row['text'].split()
date = pd.to_datetime(row['date'])
for word in words:
if word in output_df.index:
output_df.at[word, date] += 1
# Determine if the counts are equally distributed and extract a subset of words
subset = []
threshold = sum(S.values()) // len(date_range)
for word in output_df.index:
counts = output_df.loc[word].values
if all(count == threshold for count in counts):
subset.append(word)
print("Subset of words with counts distributed equally among all dates:")
print(subset)
IIUC:
#sample dataframe
data = {"text": ["words1 words2", "words1 words3", "words2 words3"],
"date": ["2023-03-17", "2023-03-18", "2023-03-19"]}
df = pd.DataFrame(data)
#list of words you only want to get count:
l=["words1","words2","words3"]
尝试
str.split()
+explode()
:
df=df.assign(text=df["text"].str.split()).explode("text")
df=df[df["text"].isin(l)] #filter out only those words
在那之后使用
pd.crosstab()
以下fillna()
df=pd.crosstab(df["text"],df["date"],df["text"],aggfunc='count').fillna(0,downcast='infer')
输出:
date 2023-03-17 2023-03-18 2023-03-19
text
words1 1 1 0
words2 1 0 1
words3 0 1 1
注意: 如果您希望索引名称和列名称标题清晰,请使用
df=df.rename_axis(index=None,columns=None)