处理Apache Spark中分区的不良文件夹结构的简单方法

Question

通常，数据可以通过类似文件夹的结构来使用，

2000-01-01/john/smith

而不是Hive分区规范，

date=2000-01-01/first_name=john/last_name=smith

使用Hive文件夹结构时，Spark（和pyspark）可以轻松读取分区的数据，但是由于“坏”文件夹结构，它变得困难且涉及正则表达式和其他内容。

是否有更简单的方法来处理Spark中分区数据的非配置单元文件夹结构？

Answer 1

通用符号partition_column=partition_value是Hive提供的便利，而不是必需的。通过在分区级别的LOCATION中指定Spark/Hive DDL属性，可以定义自己的[[任意（几乎）目录布局：

ALTER TABLE table_name添加[如果不存在] PARTITION partition_spec[LOCATION'location'] [，PARTITION partition_spec [LOCATION'位置']，...];
无论哪种情况，分区目录和列值之间的映射都将在Hive Metastore中发生。
简单示例：
[user@gateway ~]# spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.x-SNAPSHOT /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141) scala> spark.sql("create table t1 (c1 string) partitioned by (p1 string) stored as parquet"); res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("describe table t1").show(false); +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |c1 |string |null | |p1 |string |null | |# Partition Information| | | |# col_name |data_type|comment| |p1 |string |null | +-----------------------+---------+-------+ scala> spark.sql("alter table t1 add partition (p1='partition_1') location '/user/hive/warehouse/t1/part1'"); res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("describe extended t1 partition (p1='partition_1')").show(100,false); +--------------------------------+----------------------------------------------------------------------------------------------+-------+ |col_name |data_type |comment| +--------------------------------+----------------------------------------------------------------------------------------------+-------+ |c1 |string |null | |p1 |string |null | |# Partition Information | | | |# col_name |data_type |comment| |p1 |string |null | | | | | |# Detailed Partition Information| | | |Database |default | | |Table |t1 | | |Partition Values |[p1=partition_1] | | |Location |hdfs://host:8020/user/hive/warehouse/t1/part1| | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Storage Properties |[serialization.format=1] | | |Partition Parameters |{transient_lastDdlTime=1581634061} | | |Created Time |Thu Feb 13 22:47:41 UTC 2020 | | |Last Access |Thu Jan 01 00:00:00 UTC 1970 | | | | | | |# Storage Information | | | |Location |hdfs://host:8020/user/hive/warehouse/t1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Storage Properties |[serialization.format=1] | | +--------------------------------+----------------------------------------------------------------------------------------------+-------+ scala> scala> spark.sql("insert into t1 values ('A','partition_1')"); scala> spark.sql("select * from t1").show(false); +---+-----------+ |c1 |p1 | +---+-----------+ |A |partition_1| +---+-----------+
注意扩展分区定义中的| Partition Values |[p1=partition_1] |行，它确定列p1的值将“附加”到子目录.../part1/的所有数据行中。

处理Apache Spark中分区的不良文件夹结构的简单方法

问题描述投票：1回答：1

1个回答

最新问题

处理Apache Spark中分区的不良文件夹结构的简单方法

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1