如何优化在 pandas dataframe 上的列表上使用循环的函数？

Question

我在 pandas 数据帧上使用一个函数：

import spacy
from collections import Counter


# Load English language model
nlp = spacy.load("en_core_web_sm")

# Function to filter out only nouns from a list of words
def filter_nouns(words):
    SYMBOLS = '{}()[].,:;+-*/&|<>=~$1234567890#_%'
    filtered_nouns = []
    
    # Preprocess the text by removing symbols and splitting into words
    words = [word.translate({ord(SYM): None for SYM in SYMBOLS}).strip() for word in words.split()]
    
    # Process each word and filter only nouns
    filtered_nouns = [token.text for token in nlp(" ".join(words)) if token.pos_ == "NOUN"]
    
    return filtered_nouns



# Apply filtering logic to all rows in the 'NOTE' column
df['filtered_nouns'] = sf['NOTE'].apply(lambda x: filter_nouns(x))

我有一个包含 6400 行的数据集，

df['NOTE']

是从 Oracle CLOB 数据类型转换而来的一个很长的段落。

此函数对于 5-10 行运行速度很快，但对于 6400 行，则需要很长时间。

有什么方法可以优化这个。

Answer 1

一个简单的方法是使用内置的

multiprocessing

模块。将数据拆分为多个部分并独立处理。
检查文档以获取详细信息和示例。 https://docs.python.org/3/library/multiprocessing.html

如何优化在 pandas dataframe 上的列表上使用循环的函数？

问题描述投票：0回答：1

1个回答

最新问题

如何优化在 pandas dataframe 上的列表上使用循环的函数？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1