按第一个字将 nltk.FreqDist 输出分组 (python)

Question

我是一个有python基本编码能力的外行，我正在做一个数据框架，它的一列如下。我的目的是将nltk.FreqDist的输出结果按第一个字进行分组。

到目前为止，我有什么

t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)

# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

for key in sorted(filter_words):
    print("%s: %s" % (key, filter_words[key]))

sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like  replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1

我的输出中有10000多行。

我的预期输出

我想按第一个词对输出进行分组，并将其提取为一个数据框。

我尝试过的其他解决方案

我曾试着调整给定的解决方案此处和此处但没有满意的结果。

Any helpguidance appreciated.

Answer 1

试试下面的方法（文档在代码里面）。

# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...

# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)

# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
    input_str: count
    for input_str, count in input_frequencies.items()
    if count > 3
}

# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
    # You can replace this by a better implementation from nltk
    return value.split(' ')[0]

# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
    # Take the strings from the above dictionary
    frequent_inputs.keys(),
    # And key by the first word
    first_word)

# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
    # Pairs of words and count
    frequent_inputs.items(),
    # Extract the string from the pair, and then take the first word
    lambda pair: first_word(pair[0])
)

Answer 2

我成功地做到了像下面这样。可能有一个更简单的实现。但现在，这给我带来了我所期望的结果。

temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())

#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]

#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T

输出示例

按第一个字将 nltk.FreqDist 输出分组 (python)

问题描述投票：0回答：1

1个回答

最新问题

按第一个字将 nltk.FreqDist 输出分组 (python)

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1