处理Apache Spark中分区的不良文件夹结构的简单方法

问题描述 投票:1回答:1

通常,数据可以通过类似文件夹的结构来使用,

2000-01-01/john/smith

而不是Hive分区规范,

date=2000-01-01/first_name=john/last_name=smith

使用Hive文件夹结构时,Spark(和pyspark)可以轻松读取分区的数据,但是由于“坏”文件夹结构,它变得困难且涉及正则表达式和其他内容。

是否有更简单的方法来处理Spark中分区数据的非配置单元文件夹结构?

apache-spark pyspark hive hadoop-partitioning
1个回答
0
投票

通用符号partition_column=partition_value是Hive提供的便利,而不是必需的。通过在分区级别的LOCATION中指定Spark/Hive DDL属性,可以定义自己的[[任意(几乎)目录布局:

ALTER TABLE table_name添加[如果不存在] PARTITION partition_spec[LOCATION'location'] [,PARTITION partition_spec [LOCATION'位置'],...];

无论哪种情况,分区目录和列值之间的映射都将在Hive Metastore中发生。

简单示例:

[user@gateway ~]# spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.x-SNAPSHOT /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141) scala> spark.sql("create table t1 (c1 string) partitioned by (p1 string) stored as parquet"); res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("describe table t1").show(false); +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |c1 |string |null | |p1 |string |null | |# Partition Information| | | |# col_name |data_type|comment| |p1 |string |null | +-----------------------+---------+-------+ scala> spark.sql("alter table t1 add partition (p1='partition_1') location '/user/hive/warehouse/t1/part1'"); res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("describe extended t1 partition (p1='partition_1')").show(100,false); +--------------------------------+----------------------------------------------------------------------------------------------+-------+ |col_name |data_type |comment| +--------------------------------+----------------------------------------------------------------------------------------------+-------+ |c1 |string |null | |p1 |string |null | |# Partition Information | | | |# col_name |data_type |comment| |p1 |string |null | | | | | |# Detailed Partition Information| | | |Database |default | | |Table |t1 | | |Partition Values |[p1=partition_1] | | |Location |hdfs://host:8020/user/hive/warehouse/t1/part1| | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Storage Properties |[serialization.format=1] | | |Partition Parameters |{transient_lastDdlTime=1581634061} | | |Created Time |Thu Feb 13 22:47:41 UTC 2020 | | |Last Access |Thu Jan 01 00:00:00 UTC 1970 | | | | | | |# Storage Information | | | |Location |hdfs://host:8020/user/hive/warehouse/t1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Storage Properties |[serialization.format=1] | | +--------------------------------+----------------------------------------------------------------------------------------------+-------+ scala> scala> spark.sql("insert into t1 values ('A','partition_1')"); scala> spark.sql("select * from t1").show(false); +---+-----------+ |c1 |p1 | +---+-----------+ |A |partition_1| +---+-----------+

注意扩展分区定义中的| Partition Values |[p1=partition_1] |行,它确定列p1的值将“附加”到子目录.../part1/的所有数据行中。 
© www.soinside.com 2019 - 2024. All rights reserved.