使用 pyspark XML 连接器重命名文件名

问题描述 投票:0回答:1

我希望使用 Spark 连接器编写 XML。我能够实现它,但最后它生成部分文件示例(part-00000)我需要在下面的代码中将其重命名为 currentdatetime.xml 。

import os
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct

data = [
    (
        "xxx", "VN", "Goods", "bIKE", "Ignition", "20100", "13000",
        "1.0", "IT", "2024-01-23T13:15:30.45+01:00"
    ),
]

schema = [
    "PartNumber", "PartName", "Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity",
    "MsgVersion", "SenderID", "SendTime"
]

spark = SparkSession.builder.appName("example").getOrCreate()

df_nested = spark.createDataFrame(data, schema=schema) \
    .withColumn("GmfHeader", struct("MsgVersion", "SenderID", "SendTime")) \
    .withColumn("Product", struct("Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity"))

df_nested = df_nested.drop("MsgVersion", "SenderID", "SendTime", "Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity")

output_path = "/mnt/test102/xml_output/"

df_nested.coalesce(1).write \
    .format("xml") \
    .option("rootTag", "n1:Part") \
    .option("rowTag", "n1:PartMastInf") \
    .mode("overwrite") \
    .save(output_path)

print(f"XML files generated successfully at: {output_path}")

上面的代码将在我的存储帐户中生成一个文件,如图所示 我希望使用 currentdatetime.xml 创建一个 xml,并且 _SUCCESS 应该被删除。如果有任何想法可以在 Pyspark 上运行,请告诉我。

python xml pyspark azure-blob-storage azure-databricks
1个回答
0
投票

无论如何都会创建

_SUCCESS
文件。您可以使用删除它

dbutils.fs.rm('<data lake url & container>/test102/xml_output/_SUCCESS')

旁注:由于您在 Databricks 中运行此程序,因此无需创建 Spark 会话。

© www.soinside.com 2019 - 2024. All rights reserved.