在 pyspark 中处理镶木地板文件时保存时出现错误

问题描述 投票:0回答:1

我正在使用这样的 csv 文件进行处理

df = spark.read.csv(path = '/mycsv.csv', header = True)

然后保存到db

%sql
CREATE DATABASE IF NOT EXISTS MY_DB

df.write.saveAsTable("MY_DB.mycsv")

工作正常

现在如果是镶木地板,我会做同样的事情

df = spark.read.format("parquet").load(path = '/sample.parquet', header = True)

然后

df.write.saveAsTable("MY_DB.sample")

它给了我错误

分析异常

AnalysisException: 
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema. 
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.

For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping
Or you can use alias to rename it.

这意味着什么?

更新

在镶木地板文件上的打印模式上显示

root
 |-- Region: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Item Type: string (nullable = true)
 |-- Sales Channel/test: string (nullable = true)
 |-- Order Priority: string (nullable = true)
 |-- Order Date: date (nullable = true)
 |-- Order ID: integer (nullable = true)
 |-- Ship Date: date (nullable = true)
 |-- Units Sold: integer (nullable = true)
 |-- Unit Price: double (nullable = true)
 |-- Unit Cost: double (nullable = true)
 |-- Total Revenue: double (nullable = true)
 |-- Total Cost: double (nullable = true)
 |-- Total Profit: double (nullable = true)

对于 csv 文件,它显示(它是 diff 文件,parquet 是 diff 文件)

root
 |-- HashKey: string (nullable = true)
 |-- GLKey: string (nullable = true)
 |-- AccountingDateKey: string (nullable = true)
 |-- MainAccountKey: string (nullable = true)
 |-- LocationKey: string (nullable = true)
 |-- BusinessUnitKey: string (nullable = true)
 |-- DepartmentKey: string (nullable = true)
 |-- CompanyKey: string (nullable = true)
 |-- FinancialHierarchyKey: string (nullable = true)
 |-- FinancialSLIDKey: string (nullable = true)
 |-- FinancialTaxKey: string (nullable = true)
 |-- FinancialPayrollKey: string (nullable = true)
 |-- FinancialCustomerKey: string (nullable = true)
 |-- FinancialVendorKey: string (nullable = true)
 |-- FinancialBankKey: string (nullable = true)
 |-- FinancialInventoryKey: string (nullable = true)
 |-- FinancialIntangiblesKey: string (nullable = true)
 |-- FinancialBankSubKey: string (nullable = true)
 |-- DimGLKey: string (nullable = true)
 |-- DWCreatedDateTime: string (nullable = true)
python pyspark parquet
1个回答
0
投票

创建问题的列名称存在问题。在将 select 与列表理解结合使用后,它现在应该返回列名称作为 Item_Type

from pyspark.sql import functions as F

renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
© www.soinside.com 2019 - 2024. All rights reserved.