如何将字典列表转换为张量流数据集?

问题描述 投票:0回答:1

我有一个

.jsonl
数据集,我正在尝试将其转换为张量流数据集。

.jsonl 的每一行都是以下形式

{"text": "some text", "meta": "irrelevant"}

我需要将其放入张量流数据集中,其中每个元素都有一个与 tf.string 值关联的键“文本”。

我得到的最接近的是以下

import tensorflow as tf

ds = tf.data.TextLineDataset('train_mini.jsonl')

def f(tnsr):
    text = eval(tnsr.numpy())['text']
    return tf.constant(text)
    #return {'text':text}

ds = ds.map(lambda x: tf.py_function(func=f,inp=[x], Tout=tf.string))

ds = tf.data.Dataset({"text": list(ds.as_numpy_iterator())})

抛出以下错误

InvalidArgumentError: ValueError: Error converting unicode string while converting Python sequenc
e to Tensor.
Traceback (most recent call last):

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", l
ine 241, in __call__
    return func(device, token, args)

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", l
ine 130, in __call__
    ret = self._func(*args)

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py
", line 309, in wrapper
    return func(*args, **kwargs)

  File "/home/crytting/persuasion/json_to_tfds.py", line 7, in f
    return tf.constant(text)

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 262, in constant
    allow_broadcast=True)

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 270, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)

  File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 96, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)

ValueError: Error converting unicode string while converting Python sequence to Tensor.


         [[{{node EagerPyFunc}}]]


我已经尝试了很多方法来做到这一点,但没有任何效果。看起来它不应该这么难,我想知道我是否错过了一些非常简单的方法。

python json tensorflow tensorflow-datasets
1个回答
0
投票

现在(02/2024)您可以将 pandas 与 TensorFlow 一起使用,以使其变得简单。

import tensorflow as tf
import pandas as pd

# assume data is the data comming from your .jsonl file
# something like reading it with json.load()
df = pf.Dataframe.from_dict(data, orient='record')
dataset = tf.data.Dataset.from_tensor_slices(df.to_dict(orient='list'))
© www.soinside.com 2019 - 2024. All rights reserved.