一个热门编码字符

问题描述 投票:2回答:2

是否有可能在Tensorflow或Keras中对文本字符进行单热编码?

  • tf.one_hot似乎只采取整数。
  • tf.keras.preprocessing.text.one_hot似乎对单词进行单热编码,但不对字符进行编码。

除此之外,tf.keras.preprocessing.text.one_hot工作真的很奇怪,因为响应并不真正看起来是单热编码,因为以下代码:

text = "ab bba bbd"
res = tf.keras.preprocessing.text.one_hot(text=text,n=3)
print(res)

导致这个结果:

[1,2,2]

每次运行此程序时,输出都是不同的3d矢量,有时它是[1,1,1][2,1,1]。文档说,不能保证单一性,但这对我来说似乎毫无意义。

python tensorflow keras
2个回答
2
投票

你可以使用keras to_categorical

import tensorflow as tf
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(tf.keras.preprocessing.text.text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = tf.keras.utils.to_categorical(tf.keras.preprocessing
                                         .text.one_hot(text, round(vocab_size*1.3)))
print(result)

结果

[[1, 2, 3, 4, 5, 6, 1, 7, 8]]

2
投票

我找到了一个基于纯python的很好的答案,遗憾的是我不再找到源代码了。它首先将每个char转换为int,然后使用one-hot数组替换int。如果字母表具有相同的长度和相同的顺序,它甚至在所有程序上都是整个程序的唯一性。

    # Is the alphabet of all possible chars you want to convert
    alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"

    def convert_to_onehot(data):
        #Creates a dict, that maps to every char of alphabet an unique int based on position
        char_to_int = dict((c,i) for i,c in enumerate(alphabet))
        encoded_data = []
        #Replaces every char in data with the mapped int
        encoded_data.append([char_to_int[char] for char in data])
        print(encoded_data) # Prints the int encoded array

        #This part now replaces the int by an one-hot array with size alphabet
        one_hot = []
        for value in encoded_data:
            #At first, the whole array is initialized with 0
            letter = [0 for _ in range(len(alphabet))]
            #Only at the number of the int, 1 is written
            letter[value] = 1
            one_hot.append(letter)
        return one_hot

   print(convert_to_onehot("hello world"))
© www.soinside.com 2019 - 2024. All rights reserved.