通常,数据可以通过类似文件夹的结构来使用,
2000-01-01/john/smith
而不是Hive分区规范,
date=2000-01-01/first_name=john/last_name=smith
使用Hive文件夹结构时,Spark(和pyspark)可以轻松读取分区的数据,但是由于“坏”文件夹结构,它变得困难且涉及正则表达式和其他内容。
是否有更简单的方法来处理Spark中分区数据的非配置单元文件夹结构?
通用符号partition_column=partition_value
是Hive提供的便利,而不是必需的。通过在分区级别的LOCATION
中指定Spark/Hive DDL属性,可以定义自己的[[任意(几乎)目录布局:
ALTER TABLE table_name添加[如果不存在] PARTITION partition_spec[LOCATION'location'] [,PARTITION partition_spec [LOCATION'位置'],...];
无论哪种情况,分区目录和列值之间的映射都将在Hive Metastore中发生。
简单示例:
[user@gateway ~]# spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.x-SNAPSHOT
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
scala> spark.sql("create table t1 (c1 string) partitioned by (p1 string) stored as parquet");
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("describe table t1").show(false);
+-----------------------+---------+-------+
|col_name |data_type|comment|
+-----------------------+---------+-------+
|c1 |string |null |
|p1 |string |null |
|# Partition Information| | |
|# col_name |data_type|comment|
|p1 |string |null |
+-----------------------+---------+-------+
scala> spark.sql("alter table t1 add partition (p1='partition_1') location '/user/hive/warehouse/t1/part1'");
res4: org.apache.spark.sql.DataFrame = []
scala> spark.sql("describe extended t1 partition (p1='partition_1')").show(100,false);
+--------------------------------+----------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+--------------------------------+----------------------------------------------------------------------------------------------+-------+
|c1 |string |null |
|p1 |string |null |
|# Partition Information | | |
|# col_name |data_type |comment|
|p1 |string |null |
| | | |
|# Detailed Partition Information| | |
|Database |default | |
|Table |t1 | |
|Partition Values |[p1=partition_1] | |
|Location |hdfs://host:8020/user/hive/warehouse/t1/part1| |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
|Storage Properties |[serialization.format=1] | |
|Partition Parameters |{transient_lastDdlTime=1581634061} | |
|Created Time |Thu Feb 13 22:47:41 UTC 2020 | |
|Last Access |Thu Jan 01 00:00:00 UTC 1970 | |
| | | |
|# Storage Information | | |
|Location |hdfs://host:8020/user/hive/warehouse/t1 | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
|Storage Properties |[serialization.format=1] | |
+--------------------------------+----------------------------------------------------------------------------------------------+-------+
scala>
scala> spark.sql("insert into t1 values ('A','partition_1')");
scala> spark.sql("select * from t1").show(false);
+---+-----------+
|c1 |p1 |
+---+-----------+
|A |partition_1|
+---+-----------+
注意扩展分区定义中的| Partition Values |[p1=partition_1] |
行,它确定列p1
的值将“附加”到子目录.../part1/
的所有数据行中。