计算 numpy 中每个单词每 26 个字符出现的次数

Question

我有这个 numpy 数组存储在

wordlist_arr

:

[b'aabi\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
 b'aabinomin\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
 b'aaji\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
 ...
 b'zus\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
 b'zuzumo\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
 b'zuzuni\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00abcdefghijklmnopqrstuvwxyz']
(71729,)
|S64

我希望我可以为每个单词列表计算 26 个字符，因此输出形状将是

(71729, 26)

:

[[3,2,1,1,1,1,1,1,2,1,...,1],
...
]

'''
In [3,2,1,1,1,1,1,1,2,1,...,1]
index 0 represent num "a" counted, index 1 represent num "b" counted, and so on...
'''

我更改 POV 并重塑形状，以便按照角色查看它

wordlist.view('S1').reshape((wordlist.size, -1))

'''
[[b'a' b'a' b'b' ... b'x' b'y' b'z']
 [b'a' b'a' b'b' ... b'x' b'y' b'z']
 [b'a' b'a' b'j' ... b'x' b'y' b'z']
 ...
 [b'z' b'u' b's' ... b'x' b'y' b'z']
 [b'z' b'u' b'z' ... b'x' b'y' b'z']
 [b'z' b'u' b'z' ... b'x' b'y' b'z']]
(71729, 64)
|S1
'''

如你所见，我将

abcdefghijklmnopqrstuvwxyz

放在索引末尾，以便我可以使用方法

np.unique

，这样每个字母至少有 1 个。但它似乎不起作用，而是返回了所有可能使用的字符。

[b'' b'a' b'b' b'c' b'd' b'e' b'f' b'g' b'h' b'i' b'j' b'k' b'l' b'm' b'n'
 b'o' b'p' b'q' b'r' b's' b't' b'u' b'v' b'w' b'x' b'y' b'z']
(27,)
|S1

我尝试了另一种使用 numpy char count 的方法，但它似乎也不起作用：

np.char.count(wordlist.view('S1').reshape(wordlist.size, -1), alphabet, axis=1)
# ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (71729, 64) and arg 1 with shape (26,).

实际上我可以使用本机方式，但现在我正在寻找 numpythonic 方式，以便我可以利用用

C lang

编写的 numpy 性能

Answer 1

IIUC，你可以这样做：

考虑这个数组：

s = np.array([b"aabbdd", b"aabbcc", b"eeffgg", b"xxyyzz"], dtype="S6")
print(s.view("uint8").reshape(s.shape[0], -1))

打印：

[[ 97  97  98  98 100 100]
 [ 97  97  98  98  99  99]
 [101 101 102 102 103 103]
 [120 120 121 121 122 122]]

然后：

x = np.apply_along_axis(
    lambda x: np.resize(np.bincount(x), 128)[97:123],  # 97 - a, 123 - z according ASCII
    axis=1,
    arr=s.view("uint8").reshape(s.shape[0], -1),
)
print(x)

打印：

[[2 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2]]

计算 numpy 中每个单词每 26 个字符出现的次数

问题描述投票：0回答：1

1个回答

最新问题

计算 numpy 中每个单词每 26 个字符出现的次数

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1