在处理多语言数据时，需要遵循哪些数据准备步骤或技术？

Question

我正在研究多语言单词嵌入代码，我需要训练我的英语数据并用西班牙语进行测试。我将使用Facebook的MUSE库进行单词嵌入。我正在寻找一种以同样的方式预处理我的数据的方法。我已经研究了变音符修复以处理重音。

我无法想出一种方法，我可以小心地删除停用词，标点符号和天气，或者不是我应该引用它。

如何统一预处理这两种语言以创建一个词汇表，我以后可以使用它与MUSE库。

Answer 1

嗨Chandana我希望你做得很好。我会考虑使用图书馆spaCy https://spacy.io/api/doc创建它的人有一个youtube视频，其中他讨论了NLP在其他语言中的实现。您将在下面找到可以解释和删除停用词的代码。就标点符号而言，您始终可以设置要忽略的重音符号等特定字符。我个人使用KNIME，它是免费的开源进行预处理。您将不得不安装nlp扩展，但最好的是它们可以为您安装的不同语言提供不同的扩展：https://www.knime.com/knime-text-processing Stop字符过滤器（自2.9起）和Snowball stemmer节点可以应用于西班牙语。确保在节点的对话框中选择正确的语言。不幸的是，到目前为止，西班牙语没有语音标记节点。

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        newStopWords = ['your_stopword1', 'your_stop_word2']
        if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

如果您有任何问题，我希望这有助于让我知道:)

在处理多语言数据时，需要遵循哪些数据准备步骤或技术？

问题描述投票：3回答：1

1个回答

最新问题

在处理多语言数据时，需要遵循哪些数据准备步骤或技术？

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1