我希望使用 Spark 连接器编写 XML。我能够实现它,但最后它生成部分文件示例(part-00000)我需要在下面的代码中将其重命名为 currentdatetime.xml 。
import os
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct
data = [
(
"xxx", "VN", "Goods", "bIKE", "Ignition", "20100", "13000",
"1.0", "IT", "2024-01-23T13:15:30.45+01:00"
),
]
schema = [
"PartNumber", "PartName", "Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity",
"MsgVersion", "SenderID", "SendTime"
]
spark = SparkSession.builder.appName("example").getOrCreate()
df_nested = spark.createDataFrame(data, schema=schema) \
.withColumn("GmfHeader", struct("MsgVersion", "SenderID", "SendTime")) \
.withColumn("Product", struct("Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity"))
df_nested = df_nested.drop("MsgVersion", "SenderID", "SendTime", "Transport", "Vehicle", "Engine", "MaxWeight", "CylinderCapacity")
output_path = "/mnt/test102/xml_output/"
df_nested.coalesce(1).write \
.format("xml") \
.option("rootTag", "n1:Part") \
.option("rowTag", "n1:PartMastInf") \
.mode("overwrite") \
.save(output_path)
print(f"XML files generated successfully at: {output_path}")
上面的代码将在我的存储帐户中生成一个文件,如图所示 我希望使用 currentdatetime.xml 创建一个 xml,并且 _SUCCESS 应该被删除。如果有任何想法可以在 Pyspark 上运行,请告诉我。
无论如何都会创建
_SUCCESS
文件。您可以使用删除它
dbutils.fs.rm('<data lake url & container>/test102/xml_output/_SUCCESS')
旁注:由于您在 Databricks 中运行此程序,因此无需创建 Spark 会话。