如何转换 Azure Databricks 中“SHOW TABLE EXTENDED LIKE 'employe*'”示例中的“信息”列？我希望得到一些意见

Question

SQL语句

SHOW TABLE EXTENDED LIKE 'employe*';

输出：

database tableName isTemporary                          information
 -------- --------- ----------- --------------------------------------------------------------
 default  employee  false       Database: default
                                Table: employee
                                Owner: root
                                Created Time: Fri Aug 30 15:10:21 IST 2019
                                Last Access: Thu Jan 01 05:30:00 IST 1970
                                Created By: Spark 3.0.0
                                Type: MANAGED
                                Provider: hive
                                Table Properties: [transient_lastDdlTime=1567158021]
                                Location: file:/opt/spark1/spark/spark-warehouse/employee
                                Serde Library: org.apache.hadoop.hive.serde2.lazy
                                .LazySimpleSerDe
                                InputFormat: org.apache.hadoop.mapred.TextInputFormat
                                OutputFormat: org.apache.hadoop.hive.ql.io
                                .HiveIgnoreKeyTextOutputFormat
                                Storage Properties: [serialization.format=1]
                                Partition Provider: Catalog
                                Partition Columns: [`grade`]
                                Schema: root
                                  -- name: string (nullable = true)
                                  -- grade: integer (nullable = true)

我在 PySpark 中尝试过：我尝试将“allowUnquotedFieldNames”设置为“true”，因为 SQL 语句的输出不是 json/dictonary 格式。我根据输出创建了一个架构。我还认为将“primitiveAsString”设置为“true”也可以帮助解析所有其他非拉丁字符。

spark.sql("SHOW TABLE EXTENDED LIKE '*'").withColumn('info1', F.from_json(F.col('info'), schema=schema_json, options={"allowUnquotedFieldNames": "true", "primitiveAsString" : "true", "linesep": "\n"}))

这会导致对象字段全部为空。

object
Catalog: null
Database: null
Table: null
Owner: null
Created_Time: null
Last_Access: null
Created_By: null
Type: null
Provider: null
Table_Properties: null
Location: null
Serde_Library: null
InputFormat: null
OutputFormat: null
Parition_Provider: null
Schema: null

Answer 1

从 SHOW TABLE EXTENDED LIKE 'employee*' 的输出中解析“

information”列

我尝试过以下方法：

from pyspark.sql.functions import regexp_extract
pattern = r"Database:\s*(\w+)\s*Table:\s*(\w+)\s*Owner:\s*(\w+)\s*Created Time:\s*(.*?)\s*Last Access:\s*(.*?)\s*Created By:\s*(.*?)\s*Type:\s*(.*?)\s*Provider:\s*(.*?)\s*Location:\s*(.*?)\s*Serde Library:\s*(.*?)\s*InputFormat:\s*(.*?)\s*OutputFormat:\s*(.*?)\s*Partition Provider:\s*(.*?)\s*Schema:\s*(.*)"
df = spark.sql("SHOW TABLE EXTENDED LIKE 'employe*'").select("information")
df_parsed = df.withColumn('database', regexp_extract(df['information'], pattern, 1)) \
    .withColumn('tableName', regexp_extract(df['information'], pattern, 2)) \
    .withColumn('owner', regexp_extract(df['information'], pattern, 3)) \
    .withColumn('createdTime', regexp_extract(df['information'], pattern, 4)) \
    .withColumn('lastAccess', regexp_extract(df['information'], pattern, 5)) \
    .withColumn('createdBy', regexp_extract(df['information'], pattern, 6)) \
    .withColumn('type', regexp_extract(df['information'], pattern, 7)) \
    .withColumn('provider', regexp_extract(df['information'], pattern, 8)) \
    .withColumn('location', regexp_extract(df['information'], pattern, 9)) \
    .withColumn('serdeLibrary', regexp_extract(df['information'], pattern, 10)) \
    .withColumn('inputFormat', regexp_extract(df['information'], pattern, 11)) \
    .withColumn('outputFormat', regexp_extract(df['information'], pattern, 12)) \
    .withColumn('partitionProvider', regexp_extract(df['information'], pattern, 13)) \
    .withColumn('schema', regexp_extract(df['information'], pattern, 14))
df_parsed = df_parsed.drop('information')
df_parsed.show(truncate=False)

结果：

tableName   database    owner   createdTime lastAccess  createdBy   type    provider    location    serdeLibrary    inputFormat outputFormat    partitionProvider   schema
employee    default root    Fri Aug 30 15:10:21 IST 2019    Thu Jan 01 05:30:00 IST 1970    Spark 3.0.0 MANAGED hive
Table Properties: [transient_lastDdlTime=1567158021]    file:/opt/spark1/spark/spark-warehouse/employee org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  org.apache.hadoop.mapred.TextInputFormat    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]    Catalog
Partition Columns: ['grade']    root
  -- name: string (nullable = true)
  -- grade: integer (nullable = true)

在上面的代码中，定义了一个正则表达式模式，以匹配

SHOW TABLE EXTENDED 命令输出中的“information

”列的结构。

使用

regexp_extract

 解析“信息”列，前提是数据以一致的格式构建。
创建包含“信息”列的 DataFrame

如何转换 Azure Databricks 中“SHOW TABLE EXTENDED LIKE 'employe*'”示例中的“信息”列？我希望得到一些意见

问题描述投票：0回答：1

1个回答

最新问题

如何转换 Azure Databricks 中“SHOW TABLE EXTENDED LIKE 'employe*'”示例中的“信息”列？我希望得到一些意见

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1