如何转换 Azure Databricks 中“SHOW TABLE EXTENDED LIKE 'employe*'”示例中的“信息”列?我希望得到一些意见

问题描述 投票:0回答:1

SQL语句

SHOW TABLE EXTENDED LIKE 'employe*';

  • 输出
database tableName isTemporary                          information
 -------- --------- ----------- --------------------------------------------------------------
 default  employee  false       Database: default
                                Table: employee
                                Owner: root
                                Created Time: Fri Aug 30 15:10:21 IST 2019
                                Last Access: Thu Jan 01 05:30:00 IST 1970
                                Created By: Spark 3.0.0
                                Type: MANAGED
                                Provider: hive
                                Table Properties: [transient_lastDdlTime=1567158021]
                                Location: file:/opt/spark1/spark/spark-warehouse/employee
                                Serde Library: org.apache.hadoop.hive.serde2.lazy
                                .LazySimpleSerDe
                                InputFormat: org.apache.hadoop.mapred.TextInputFormat
                                OutputFormat: org.apache.hadoop.hive.ql.io
                                .HiveIgnoreKeyTextOutputFormat
                                Storage Properties: [serialization.format=1]
                                Partition Provider: Catalog
                                Partition Columns: [`grade`]
                                Schema: root
                                  -- name: string (nullable = true)
                                  -- grade: integer (nullable = true)

  • 我在 PySpark 中尝试过: 我尝试将“allowUnquotedFieldNames”设置为“true”,因为 SQL 语句的输出不是 json/dictonary 格式。我根据输出创建了一个架构。我还认为将“primitiveAsString”设置为“true”也可以帮助解析所有其他非拉丁字符。
spark.sql("SHOW TABLE EXTENDED LIKE '*'").withColumn('info1', F.from_json(F.col('info'), schema=schema_json, options={"allowUnquotedFieldNames": "true", "primitiveAsString" : "true", "linesep": "\n"}))
这会导致对象字段全部为空。

object Catalog: null Database: null Table: null Owner: null Created_Time: null Last_Access: null Created_By: null Type: null Provider: null Table_Properties: null Location: null Serde_Library: null InputFormat: null OutputFormat: null Parition_Provider: null Schema: null
    
azure parsing pyspark databricks azure-databricks
1个回答
0
投票
从 SHOW TABLE EXTENDED LIKE 'employee*' 的输出中解析“

information”列

我尝试过以下方法:

from pyspark.sql.functions import regexp_extract pattern = r"Database:\s*(\w+)\s*Table:\s*(\w+)\s*Owner:\s*(\w+)\s*Created Time:\s*(.*?)\s*Last Access:\s*(.*?)\s*Created By:\s*(.*?)\s*Type:\s*(.*?)\s*Provider:\s*(.*?)\s*Location:\s*(.*?)\s*Serde Library:\s*(.*?)\s*InputFormat:\s*(.*?)\s*OutputFormat:\s*(.*?)\s*Partition Provider:\s*(.*?)\s*Schema:\s*(.*)" df = spark.sql("SHOW TABLE EXTENDED LIKE 'employe*'").select("information") df_parsed = df.withColumn('database', regexp_extract(df['information'], pattern, 1)) \ .withColumn('tableName', regexp_extract(df['information'], pattern, 2)) \ .withColumn('owner', regexp_extract(df['information'], pattern, 3)) \ .withColumn('createdTime', regexp_extract(df['information'], pattern, 4)) \ .withColumn('lastAccess', regexp_extract(df['information'], pattern, 5)) \ .withColumn('createdBy', regexp_extract(df['information'], pattern, 6)) \ .withColumn('type', regexp_extract(df['information'], pattern, 7)) \ .withColumn('provider', regexp_extract(df['information'], pattern, 8)) \ .withColumn('location', regexp_extract(df['information'], pattern, 9)) \ .withColumn('serdeLibrary', regexp_extract(df['information'], pattern, 10)) \ .withColumn('inputFormat', regexp_extract(df['information'], pattern, 11)) \ .withColumn('outputFormat', regexp_extract(df['information'], pattern, 12)) \ .withColumn('partitionProvider', regexp_extract(df['information'], pattern, 13)) \ .withColumn('schema', regexp_extract(df['information'], pattern, 14)) df_parsed = df_parsed.drop('information') df_parsed.show(truncate=False)

结果:

tableName database owner createdTime lastAccess createdBy type provider location serdeLibrary inputFormat outputFormat partitionProvider schema employee default root Fri Aug 30 15:10:21 IST 2019 Thu Jan 01 05:30:00 IST 1970 Spark 3.0.0 MANAGED hive Table Properties: [transient_lastDdlTime=1567158021] file:/opt/spark1/spark/spark-warehouse/employee org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Catalog Partition Columns: ['grade'] root -- name: string (nullable = true) -- grade: integer (nullable = true)
在上面的代码中,定义了一个正则表达式模式,以匹配 

SHOW TABLE EXTENDED 命令输出中的“information

”列的结构。

使用

regexp_extract

 解析“信息”列,前提是数据以一致的格式构建。
创建包含“信息”列的 DataFrame

© www.soinside.com 2019 - 2024. All rights reserved.