SQL语句
SHOW TABLE EXTENDED LIKE 'employe*';
database tableName isTemporary information
-------- --------- ----------- --------------------------------------------------------------
default employee false Database: default
Table: employee
Owner: root
Created Time: Fri Aug 30 15:10:21 IST 2019
Last Access: Thu Jan 01 05:30:00 IST 1970
Created By: Spark 3.0.0
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1567158021]
Location: file:/opt/spark1/spark/spark-warehouse/employee
Serde Library: org.apache.hadoop.hive.serde2.lazy
.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io
.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Partition Columns: [`grade`]
Schema: root
-- name: string (nullable = true)
-- grade: integer (nullable = true)
spark.sql("SHOW TABLE EXTENDED LIKE '*'").withColumn('info1', F.from_json(F.col('info'), schema=schema_json, options={"allowUnquotedFieldNames": "true", "primitiveAsString" : "true", "linesep": "\n"}))
这会导致对象字段全部为空。
object
Catalog: null
Database: null
Table: null
Owner: null
Created_Time: null
Last_Access: null
Created_By: null
Type: null
Provider: null
Table_Properties: null
Location: null
Serde_Library: null
InputFormat: null
OutputFormat: null
Parition_Provider: null
Schema: null
information”列
我尝试过以下方法:
from pyspark.sql.functions import regexp_extract
pattern = r"Database:\s*(\w+)\s*Table:\s*(\w+)\s*Owner:\s*(\w+)\s*Created Time:\s*(.*?)\s*Last Access:\s*(.*?)\s*Created By:\s*(.*?)\s*Type:\s*(.*?)\s*Provider:\s*(.*?)\s*Location:\s*(.*?)\s*Serde Library:\s*(.*?)\s*InputFormat:\s*(.*?)\s*OutputFormat:\s*(.*?)\s*Partition Provider:\s*(.*?)\s*Schema:\s*(.*)"
df = spark.sql("SHOW TABLE EXTENDED LIKE 'employe*'").select("information")
df_parsed = df.withColumn('database', regexp_extract(df['information'], pattern, 1)) \
.withColumn('tableName', regexp_extract(df['information'], pattern, 2)) \
.withColumn('owner', regexp_extract(df['information'], pattern, 3)) \
.withColumn('createdTime', regexp_extract(df['information'], pattern, 4)) \
.withColumn('lastAccess', regexp_extract(df['information'], pattern, 5)) \
.withColumn('createdBy', regexp_extract(df['information'], pattern, 6)) \
.withColumn('type', regexp_extract(df['information'], pattern, 7)) \
.withColumn('provider', regexp_extract(df['information'], pattern, 8)) \
.withColumn('location', regexp_extract(df['information'], pattern, 9)) \
.withColumn('serdeLibrary', regexp_extract(df['information'], pattern, 10)) \
.withColumn('inputFormat', regexp_extract(df['information'], pattern, 11)) \
.withColumn('outputFormat', regexp_extract(df['information'], pattern, 12)) \
.withColumn('partitionProvider', regexp_extract(df['information'], pattern, 13)) \
.withColumn('schema', regexp_extract(df['information'], pattern, 14))
df_parsed = df_parsed.drop('information')
df_parsed.show(truncate=False)
结果:
tableName database owner createdTime lastAccess createdBy type provider location serdeLibrary inputFormat outputFormat partitionProvider schema
employee default root Fri Aug 30 15:10:21 IST 2019 Thu Jan 01 05:30:00 IST 1970 Spark 3.0.0 MANAGED hive
Table Properties: [transient_lastDdlTime=1567158021] file:/opt/spark1/spark/spark-warehouse/employee org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1] Catalog
Partition Columns: ['grade'] root
-- name: string (nullable = true)
-- grade: integer (nullable = true)
在上面的代码中,定义了一个正则表达式模式,以匹配 SHOW TABLE EXTENDED
命令输出中的“information
”列的结构。使用
regexp_extract
解析“信息”列,前提是数据以一致的格式构建。 创建包含“信息”列的 DataFrame