Word2Vec二次采样—实现

问题描述 投票:0回答:1

我正在Pytorch和Tensorflow2中实现Skipgram模型。我对常用字的二次采样的实施方式有疑问。根据本文的逐字记录,二次采样词wi的概率计算为:

enter image description here

其中t是自定义阈值(通常是一个很小的值,例如0.0001

),f是文档中单词的出现频率。尽管作者以不同但几乎等效的方式实现了它,但让我们坚持这个定义。

[计算P(wi)时,我们可以得到负值。例如,假设我们有100个单词,并且其中一个单词的出现频率比其他单词高得多(因为我的数据集就是这种情况)。

import numpy as np
import seaborn as sns

np.random.seed(12345)

# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)

# add an extremely bigger count
counts = np.insert(counts, 0, 100000)

# compute frequencies
f = counts/counts.sum()

# define threshold as in paper
t = 0.0001

# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);

Q:使用此“概率”进行二次采样的正确方法是什么?

作为附加信息,我已经看到在keras中,函数keras.preprocessing.sequence.make_sampling_table采用了不同的方法:

def make_sampling_table(size, sampling_factor=1e-5):
    """Generates a word rank-based probabilistic sampling table.
    Used for generating the `sampling_table` argument for `skipgrams`.
    `sampling_table[i]` is the probability of sampling
    the i-th most common word in a dataset
    (more common words should be sampled less frequently, for balance).
    The sampling probabilities are generated according
    to the sampling distribution used in word2vec:
    ```
    p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
        (word_frequency / sampling_factor)))
    ```
    We assume that the word frequencies follow Zipf's law (s=1) to derive
    a numerical approximation of frequency(rank):
    `frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
    where `gamma` is the Euler-Mascheroni constant.
    # Arguments
        size: Int, number of possible words to sample.
        sampling_factor: The sampling factor in the word2vec formula.
    # Returns
        A 1D Numpy array of length `size` where the ith entry
        is the probability that a word of rank i should be sampled.
    """
    gamma = 0.577
    rank = np.arange(size)
    rank[0] = 1
    inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
    f = sampling_factor * inv_fq

    return np.minimum(1., f / np.sqrt(f))

我正在Pytorch和Tensorflow2中实现Skipgram模型。我对常用字的二次采样的实施方式有疑问。从本文的逐字记录中,...

keras word2vec tf.keras subsampling
1个回答
0
投票

[我倾向于更信任部署的代码,而不是纸上的文章,特别是在诸如word2vec这样的情况下,论文作者发布的原始作者word2vec.c code已被广泛使用并用作其他实现的模板。如果我们看一下它的二次采样机制...

© www.soinside.com 2019 - 2024. All rights reserved.