基于n-gram在数据流中选择ID /行

Question

我有以下数据集：

ID       Text
12     Coolest fan we’ve ever seen.
12     SHARE this with anyone you know who can use this tip!
31     Time for a Royal Celebration! Save the date.
54     The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419    Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451    Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned

我已经使用ngram来分析文本和单词/句子的频率。

from nltk import ngrams

text=df.Text.tolist()

list_n=[]


for i in text:
    n_grams = ngrams(i.split(), 3)

    for grams in n_grams:
        list_n.append(grams)

list_n

由于我有兴趣查找在哪个文本中使用了特定的单词/单词序列，因此我需要在文本（即ID）和具有特定ngram的文本之间创建关联。例如：我对查找包含"Save the date"，即ID=31和ID=451的文本感兴趣。为了找到一个单词的n-gram，我一直在使用：

def ngram_filter(col, word, n):
    tokens = col.split()
    all_ngrams = ngrams(tokens, n)
    filtered_ngrams = [x for x in all_ngrams if word in x]
    return filtered_ngrams
但是，我不知道如何找到与文本关联的ID，以及如何在上述功能中选择更多单词。

我该怎么做？任何的想法？

如果需要，请随时更改标签。谢谢

我有以下数据集：ID文字12我们见过的最酷的风扇。 12与任何可以使用此技巧的人分享！ 31次皇家庆典！保存日期。 54 ...

Answer 1

我对ngrams经验不足，但是您可以通过str.contains得到想要的东西：

基于n-gram在数据流中选择ID /行

问题描述投票：1回答：1

1个回答

最新问题

基于n-gram在数据流中选择ID /行

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1