无法使用 Google 存储桶中的 Biquery 解析 JSON

问题描述 投票:0回答:2

我从后端上传附加的 JSON 到 Google 存储桶, 现在我尝试将此 JSON 连接到 Bigquery 表,但出现以下错误,我需要进行哪些更改?

读取表时出错:XXXXX,错误消息:无法解析 JSON:启动新数组时未找到对象。; BeginArray 返回 false;解析器在字符串结束之前终止

[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"],["video_screen","click_on_screen","false","202011230213","1","4","0"],["video_screen","click_on_screen","false","202011230633","1","4","0"],["video_screen","click_on_screen","false","202011230709","1","4","0"],["video_screen","click_on_screen","false","202011230712","1","4","0"],["video_screen","click_on_screen","false","202011230723","1","4","0"],["video_screen","click_on_screen","false","202011230725","1","4","0"],["video_screen","click_on_screen","false","202011231739","1","4","0"],["category","select","MTV","202011232228","1","3","0"],["sign_in","enter","user_details","202011230108","2","3","0"],["sign_in","enter","user_details","202011230442","2","3","0"],["video","select","youtube","202011230108","1","3","0"],["video","select","youtube","202011230633","1","3","0"],["video_screen","click_on_screen","false","202011230458","1","3","0"],["video_screen","click_on_screen","false","202011230552","1","3","0"],["video_screen","click_on_screen","false","202011230612","1","3","0"],["video_screen","click_on_screen","false","202011231740","1","3","0"],["category","select","Disney Karaoke","202011232228","1","2","0"],["category","select","Duet","202011232228","1","2","0"],["category","select","Free","202011230726","1","2","0"],["category","select","Free","202011231830","2","2","0"],["category","select","Free","202011232228","1","2","0"],["category","select","Love","202011232228","1","2","0"],["category","select","New","202011232228","1","2","0"],["category","select","Pitch Perfect 2","202011232228","1","2","0"],["developer","click","hithub","202011230749","1","2","0"],["sign_in","enter","user_details","202011230134","1","2","0"],["sign_in","enter","user_details","202011230211","1","2","0"],["sign_in","enter","user_details","202011230219","1","2","0"]]
json google-bigquery google-cloud-storage
2个回答
2
投票

Bigquery 读取 JSONL 文件。该示例不是采用该格式。

  1. JSONL 使用
    \n
    作为记录之间的分隔符。该示例全部在一行上,并用逗号分隔。
  2. 每个 JSONL 行都是一个 json 对象,因此以
    {
    开头,以
    }
    结尾。该示例包含不支持的 JSON 数组。
  3. JSONL 基于 JSON。每个数据元素都需要命名。因此第一条记录可能显示为
    { "field1_name": "video_screen", "field2_name": "click_on_screen", "field3_name": false, "field4_name": 202011231958, "Field5_name": 1, "field6_name": 43, "field7_name": 0}
  4. JSONL 没有最外一对括号
    []
    。第一行开始于
    {
    ,而不是
    [{
    ,最后一行结束于
    }
    ,而不是
    }]

0
投票

这里是 Steven 回答的 Python 解决方案,用于将 GCS 中的 JSON 文件转换为 BigQuery 可以导入的文件。 bucket、gcp_prefix、source_filemodified_name 是与您的 GCS 项目相关的变量,如果文件位于存储桶根目录中,则不需要 gcp_prefix:

import json

from google.cloud import storage
from io import BytesIO, StringIO


def get_storage_client(bucket_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)

    return bucket


def download_from_gcs(bucket_name, file_name):
    file_io = BytesIO()
    bucket = get_storage_client(bucket_name)
    blob = bucket.blob(file_name)
    blob.download_to_file(file_io)

    return file_io


def upload_to_gcs(bucket_name, file_name, file_io):
    bucket = get_storage_client(bucket_name)
    blob = bucket.blob(file_name)
    blob.upload_from_file(file_io, rewind=True)
    blob_id = f'gs://{blob.id}'

    return blob_id


def generate_json_file(gcs_object):
    gcs_object.seek(0)
    decoded_json = json.loads(gcs_object.read().decode('utf-8'))
    content_string = [json.dumps(row) for row in decoded_json]
    json_content = '\n'.join(content_string)

    return json_content


file_object = BytesIO(download_from_gcs(bucket, source_file)['Body'].read())
modified_json = generate_json_file(file_object)
binary_json = StringIO(modified_json).getvalue().encode('utf-8')
blob_id = upload_to_gcs(bucket, f'{gcp_prefix}/{modified_name}', binary_json)
© www.soinside.com 2019 - 2024. All rights reserved.