我正在使用这样的 csv 文件进行处理
df = spark.read.csv(path = '/mycsv.csv', header = True)
然后保存到db
%sql
CREATE DATABASE IF NOT EXISTS MY_DB
和
df.write.saveAsTable("MY_DB.mycsv")
工作正常
现在如果是镶木地板,我会做同样的事情
df = spark.read.format("parquet").load(path = '/sample.parquet', header = True)
然后
df.write.saveAsTable("MY_DB.sample")
它给了我错误
分析异常
AnalysisException:
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema.
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.
For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping
Or you can use alias to rename it.
这意味着什么?
更新
在镶木地板文件上的打印模式上显示
root
|-- Region: string (nullable = true)
|-- Country: string (nullable = true)
|-- Item Type: string (nullable = true)
|-- Sales Channel/test: string (nullable = true)
|-- Order Priority: string (nullable = true)
|-- Order Date: date (nullable = true)
|-- Order ID: integer (nullable = true)
|-- Ship Date: date (nullable = true)
|-- Units Sold: integer (nullable = true)
|-- Unit Price: double (nullable = true)
|-- Unit Cost: double (nullable = true)
|-- Total Revenue: double (nullable = true)
|-- Total Cost: double (nullable = true)
|-- Total Profit: double (nullable = true)
对于 csv 文件,它显示(它是 diff 文件,parquet 是 diff 文件)
root
|-- HashKey: string (nullable = true)
|-- GLKey: string (nullable = true)
|-- AccountingDateKey: string (nullable = true)
|-- MainAccountKey: string (nullable = true)
|-- LocationKey: string (nullable = true)
|-- BusinessUnitKey: string (nullable = true)
|-- DepartmentKey: string (nullable = true)
|-- CompanyKey: string (nullable = true)
|-- FinancialHierarchyKey: string (nullable = true)
|-- FinancialSLIDKey: string (nullable = true)
|-- FinancialTaxKey: string (nullable = true)
|-- FinancialPayrollKey: string (nullable = true)
|-- FinancialCustomerKey: string (nullable = true)
|-- FinancialVendorKey: string (nullable = true)
|-- FinancialBankKey: string (nullable = true)
|-- FinancialInventoryKey: string (nullable = true)
|-- FinancialIntangiblesKey: string (nullable = true)
|-- FinancialBankSubKey: string (nullable = true)
|-- DimGLKey: string (nullable = true)
|-- DWCreatedDateTime: string (nullable = true)
创建问题的列名称存在问题。在将 select 与列表理解结合使用后,它现在应该返回列名称作为 Item_Type
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])