我想将 numpy 数组写入 avro 文件。这是 numpy 数组的一个小例子:
import numpy as np
import random
np_array = np.zeros((4,3), dtype=np.float32)
for i in range(4):
for j in range(3):
np_array[i, j] = random.gauss(0, 1)
print(np_array)
输出:
[[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]]
对于我的用例,numpy 数组有 500 万行和 128 列,所以如果可能的话,我想将数组直接写入 avro,而不花费内存将其转换为字典和/或 Pandas DataFrame。
我能够回答我自己的问题。
import numpy as np
import random
np_array = np.zeros((4,3), dtype=np.float32)
for i in range(4):
for j in range(3):
np_array[i, j] = random.gauss(0, 1)
print(np_array)
输出:
[[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]]
import fastavro
schema_dict = {
"doc": "test",
"name": "test",
"namespace": "test",
"type": "array",
"items": "float"
}
schema = fastavro.parse_schema(schema_dict)
with open(<filepath>, "wb") as f:
fastavro.writer(f, schema, np_array)
with open(<filepath>, "rb") as f:
reader = fastavro.reader(f)
for record in reader:
print(record)
输出:
[ 0.6490377 0.29544145 -1.109375 ]
[ 1.0881975 -0.39123887 -0.36691198]
[-1.2226632 0.8332004 0.2686829 ]
[ 1.5417658 0.4520132 -0.03081623]