我如何将组词应用于自己的文本语料库？

Question

我最近遇到了这篇文章（https://arxiv.org/pdf/1605.09096.pdf），并且我一直在阅读GitHub（https://github.com/williamleif/histwords），但我仍然不清楚如何将其应用于自己的数据。我的数据采用以下格式：

#### 2008
   text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.",
            "I love coding in Python in 2008",
            "I love building chatbots in 2008",
            "they chat amagingly well"]})
    ID_2008 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2008 = text.join(ID)

#### 2009
   text_2009 = pd.DataFrame({'dat1': ["I love machine learning. Its awesome.",
            "I love coding in Python",
            "I love building chatbots",
            "they chat amagingly well"]})
    ID_2009 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2009 = text.join(ID)


#### 2010
   text_2010 = pd.DataFrame({'dat1': ["I love machine learning more in 2010. Its awesome.",
            "I love coding in Python in 2010",
            "I love building chatbots in 2010",
            "they chat amagingly well"]})
    ID_2010 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2010 = text.join(ID)

所以我有多个熊猫数据帧，每行包含ID和text列。

根据我的理解，sgns使用.txt文件而不是pandas数据帧。（https://github.com/williamleif/histwords/tree/master/sgns）

在主页上显示“如果您想学习新数据的历史嵌入，建议使用sgns目录中的代码”

[如果有人可以按正确方向推动我，那就太好了！我应该将熊猫行“文本”另存为.txt文件吗？

Answer 1

检查自述文件中提到的此管道。

**DATA:**  raw corpus  =>  corpus  =>  pairs  =>  counts  =>  vocab  
**TRADITIONAL:**  counts + vocab  =>  pmi  =>  svd  
**EMBEDDINGS:**  pairs  + vocab  =>  sgns  

**raw corpus  =>  corpus**  
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.

**corpus  =>  pairs**  
- *corpus2pairs.py*  
- Extracts a collection of word-context pairs from the corpus.

**pairs  =>  counts**  
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.

**counts  =>  vocab**  
- *counts2vocab.py*  
- Creates vocabularies with the words' and contexts' unigram distributions.

**counts + vocab  =>  pmi**  
- *counts2pmi.py*  
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.

**pmi  =>  svd**  
- *pmi2svd.py*  
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.

**pairs  + vocab  =>  sgns**  
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:  
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**

An example pipeline is demonstrated in: *example_test.sh*

来自here

我如何将组词应用于自己的文本语料库？

问题描述投票：1回答：1

1个回答

最新问题

我如何将组词应用于自己的文本语料库？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1