我最近遇到了这篇文章(https://arxiv.org/pdf/1605.09096.pdf),并且我一直在阅读GitHub(https://github.com/williamleif/histwords),但我仍然不清楚如何将其应用于自己的数据。我的数据采用以下格式:
#### 2008
text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.",
"I love coding in Python in 2008",
"I love building chatbots in 2008",
"they chat amagingly well"]})
ID_2008 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2008 = text.join(ID)
#### 2009
text_2009 = pd.DataFrame({'dat1': ["I love machine learning. Its awesome.",
"I love coding in Python",
"I love building chatbots",
"they chat amagingly well"]})
ID_2009 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2009 = text.join(ID)
#### 2010
text_2010 = pd.DataFrame({'dat1': ["I love machine learning more in 2010. Its awesome.",
"I love coding in Python in 2010",
"I love building chatbots in 2010",
"they chat amagingly well"]})
ID_2010 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2010 = text.join(ID)
所以我有多个熊猫数据帧,每行包含ID
和text
列。
根据我的理解,sgns使用.txt文件而不是pandas数据帧。 (https://github.com/williamleif/histwords/tree/master/sgns)
在主页上显示“如果您想学习新数据的历史嵌入,建议使用sgns目录中的代码”
[如果有人可以按正确方向推动我,那就太好了!我应该将熊猫行“文本”另存为.txt文件吗?
检查自述文件中提到的此管道。
**DATA:** raw corpus => corpus => pairs => counts => vocab
**TRADITIONAL:** counts + vocab => pmi => svd
**EMBEDDINGS:** pairs + vocab => sgns
**raw corpus => corpus**
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.
**corpus => pairs**
- *corpus2pairs.py*
- Extracts a collection of word-context pairs from the corpus.
**pairs => counts**
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.
**counts => vocab**
- *counts2vocab.py*
- Creates vocabularies with the words' and contexts' unigram distributions.
**counts + vocab => pmi**
- *counts2pmi.py*
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.
**pmi => svd**
- *pmi2svd.py*
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.
**pairs + vocab => sgns**
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**
An example pipeline is demonstrated in: *example_test.sh*
来自here