如何避免对 pandas 数据框中一行已词形还原的句子进行词形还原以提高速度

问题描述 投票:0回答:2

给出

一个简单的小pandas数据框如下:

df = pd.DataFrame(
    {
        "user_ip":       ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
        "raw_sentence":  ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
    }
  )

    user_ip    raw_sentence
0   u7         First sentence!
1   u3         NaN
2   u1         I go to school everyday! 
3   u9         She likes chips!
4   u4         I go to school everyday!     <<< duplicate >>>
5   u8         This is 1 sample text!
6   u1         She likes chips!             <<< duplicate >>>
7   u2         This is the thrid sentence.
8   u5         I go to school everyday!     <<< duplicate >>>

目标

我想知道我是否可以避免调用

map
或考虑针对
raw_sentence
列中具有重复(完全相同)句子的行的任何其他策略。我的目的是加快更大尺寸的 pandas 数据框的实现(
~100K
行)。

[效率低下]解决方案:

现在,我利用

.map()
使用
lambda
遍历每一行并调用
get_lm()
函数来检索原始输入句子的引理,如下所示:

import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

    user_ip     raw_sentence                    lemma
0   u7          First sentence!                 [first, sentence]         <<< 1st occurrence => lemmatization OK! >>>
1   u3          NaN                             NaN                       <<< ignone None using na_action='ignore' >>>
2   u1          I go to school everyday!        [go, school, everyday]    <<< 1st occurrence => lemmatization OK! >>>
3   u9          She likes chips!                [like, chip]              <<< 1st occurrence => lemmatization OK! >>>
4   u4          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>
5   u8          This is 1 sample text!          [sample, text]            <<< 1st occurrence => lemmatization OK! >>>
6   u1          She likes chips!                [like, chip]              <<< already lemmatized, no need to do it again >>>
7   u2          This is the thrid sentence.     [thrid, sentence]         <<< 1st occurrence => lemmatization OK! >>>
8   u5          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>

有没有更有效的方法来解决这个问题?

干杯,

python pandas nlp nltk lemmatization
2个回答
1
投票

一种可能性是仅对列中的唯一句子运行

map

然后将其用作映射

dict
-ionary:

mapper = dict(zip(df["raw_sentence"].drop_duplicates(),df["raw_sentence"].drop_duplicates().map(lambda raw: get_lm(input_sent=raw), na_action='ignore').rename("lemma")))

df["lemma"] = df["raw_sentence"].map(mapped)

1
投票

不要重新发明轮子,使用

functools.cache
:

from functools import cache

@cache
def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

输出:

  user_ip                 raw_sentence                   lemma
0      u7              First sentence!       [first, sentence]
1      u3                          NaN                     NaN
2      u1     I go to school everyday!  [go, school, everyday]
3      u9             She likes chips!            [like, chip]
4      u4     I go to school everyday!  [go, school, everyday]
5      u8       This is 1 sample text!          [sample, text]
6      u1             She likes chips!            [like, chip]
7      u2  This is the thrid sentence.       [thrid, sentence]
8      u5     I go to school everyday!  [go, school, everyday]
© www.soinside.com 2019 - 2024. All rights reserved.