如何避免对 pandas 数据框中一行已词形还原的句子进行词形还原以提高速度

Question

给出：

一个简单的小pandas数据框如下：

df = pd.DataFrame(
    {
        "user_ip":       ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
        "raw_sentence":  ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
    }
  )

    user_ip    raw_sentence
0   u7         First sentence!
1   u3         NaN
2   u1         I go to school everyday! 
3   u9         She likes chips!
4   u4         I go to school everyday!     <<< duplicate >>>
5   u8         This is 1 sample text!
6   u1         She likes chips!             <<< duplicate >>>
7   u2         This is the thrid sentence.
8   u5         I go to school everyday!     <<< duplicate >>>

目标：

我想知道我是否可以避免调用

map

或考虑针对

raw_sentence

列中具有重复（完全相同）句子的行的任何其他策略。我的目的是加快更大尺寸的 pandas 数据框的实现（

~100K

行）。

[效率低下]解决方案:

现在，我利用

.map()

使用

lambda

遍历每一行并调用

get_lm()

函数来检索原始输入句子的引理，如下所示：

import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

    user_ip     raw_sentence                    lemma
0   u7          First sentence!                 [first, sentence]         <<< 1st occurrence => lemmatization OK! >>>
1   u3          NaN                             NaN                       <<< ignone None using na_action='ignore' >>>
2   u1          I go to school everyday!        [go, school, everyday]    <<< 1st occurrence => lemmatization OK! >>>
3   u9          She likes chips!                [like, chip]              <<< 1st occurrence => lemmatization OK! >>>
4   u4          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>
5   u8          This is 1 sample text!          [sample, text]            <<< 1st occurrence => lemmatization OK! >>>
6   u1          She likes chips!                [like, chip]              <<< already lemmatized, no need to do it again >>>
7   u2          This is the thrid sentence.     [thrid, sentence]         <<< 1st occurrence => lemmatization OK! >>>
8   u5          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>

有没有更有效的方法来解决这个问题？

干杯，

Answer 1

一种可能性是仅对列中的唯一句子运行

map

。

然后将其用作映射

dict

-ionary:

mapper = dict(zip(df["raw_sentence"].drop_duplicates(),df["raw_sentence"].drop_duplicates().map(lambda raw: get_lm(input_sent=raw), na_action='ignore').rename("lemma")))

df["lemma"] = df["raw_sentence"].map(mapped)

Answer 2

不要重新发明轮子，使用

functools.cache

:

from functools import cache

@cache
def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

输出：

  user_ip                 raw_sentence                   lemma
0      u7              First sentence!       [first, sentence]
1      u3                          NaN                     NaN
2      u1     I go to school everyday!  [go, school, everyday]
3      u9             She likes chips!            [like, chip]
4      u4     I go to school everyday!  [go, school, everyday]
5      u8       This is 1 sample text!          [sample, text]
6      u1             She likes chips!            [like, chip]
7      u2  This is the thrid sentence.       [thrid, sentence]
8      u5     I go to school everyday!  [go, school, everyday]

如何避免对 pandas 数据框中一行已词形还原的句子进行词形还原以提高速度

问题描述投票：0回答：2

2个回答

最新问题

如何避免对 pandas 数据框中一行已词形还原的句子进行词形还原以提高速度

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2