HIVE_METASTORE_ERROR 需要“STRING”,但找到“STRING”

问题描述 投票:0回答:5

我无法对我的 AWS Glue 分区表进行任何查询。我收到的错误是

HIVE_METASTORE_ERROR:com.facebook.presto.spi.PrestoException:错误: 预期在“STRING”的位置 0 处键入,但找到了“STRING”。 (服务:空;状态代码:0;错误代码:空;请求 ID:空)

我发现另一个线程提出了这样一个事实:数据库名称和表不能包含字母数字和下划线以外的字符。因此,我确保数据库名称、表名称和所有列名称都遵守此限制。唯一不遵守此限制的对象是我的 s3 存储桶名称,该名称很难更改。

以下是数据的表定义和 parquet-tools 转储。

AWS Glue 表定义

{
    "Table": {
        "UpdateTime": 1545845064.0, 
        "PartitionKeys": [
            {
                "Comment": "call_time year", 
                "Type": "INT", 
                "Name": "date_year"
            }, 
            {
                "Comment": "call_time month", 
                "Type": "INT", 
                "Name": "date_month"
            }, 
            {
                "Comment": "call_time day", 
                "Type": "INT", 
                "Name": "date_day"
            }
        ], 
        "StorageDescriptor": {
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat", 
            "SortColumns": [], 
            "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat", 
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", 
                "Name": "ser_de_info_system_admin_created", 
                "Parameters": {
                    "serialization.format": "1"
                }
            }, 
            "BucketColumns": [], 
            "Parameters": {}, 
            "Location": "s3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/", 
            "NumberOfBuckets": 0, 
            "StoredAsSubDirectories": false, 
            "Columns": [
                {
                    "Comment": "Unique user ID", 
                    "Type": "STRING", 
                    "Name": "user_id"
                }, 
                {
                    "Comment": "Unique group ID", 
                    "Type": "STRING", 
                    "Name": "group_id"
                }, 
                {
                    "Comment": "Date and time the message was published", 
                    "Type": "TIMESTAMP", 
                    "Name": "call_time"
                }, 
                {
                    "Comment": "call_time year", 
                    "Type": "INT", 
                    "Name": "date_year"
                }, 
                {
                    "Comment": "call_time month", 
                    "Type": "INT", 
                    "Name": "date_month"
                }, 
                {
                    "Comment": "call_time day", 
                    "Type": "INT", 
                    "Name": "date_day"
                }, 
                {
                    "Comment": "Given name for user", 
                    "Type": "STRING", 
                    "Name": "given_name"
                }, 
                {
                    "Comment": "IANA time zone for user", 
                    "Type": "STRING", 
                    "Name": "time_zone"
                }, 
                {
                    "Comment": "Name that links to geneaology", 
                    "Type": "STRING", 
                    "Name": "family_name"
                }, 
                {
                    "Comment": "Email address for user", 
                    "Type": "STRING", 
                    "Name": "email"
                }, 
                {
                    "Comment": "RFC BCP 47 code set in this user's profile language and region", 
                    "Type": "STRING", 
                    "Name": "language"
                }, 
                {
                    "Comment": "Phone number including ITU-T ITU-T E.164 country codes", 
                    "Type": "STRING", 
                    "Name": "phone"
                }, 
                {
                    "Comment": "Date user was created", 
                    "Type": "TIMESTAMP", 
                    "Name": "date_created"
                }, 
                {
                    "Comment": "User role", 
                    "Type": "STRING", 
                    "Name": "role"
                }, 
                {
                    "Comment": "Provider dashboard preferences", 
                    "Type": "STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING>", 
                    "Name": "preferences"
                }, 
                {
                    "Comment": "Provider notification settings", 
                    "Type": "STRUCT<digest_email:BOOLEAN>", 
                    "Name": "notifications"
                }
            ], 
            "Compressed": true
        }, 
        "Parameters": {
            "classification": "parquet", 
            "parquet.compress": "SNAPPY"
        }, 
        "Description": "System wide admin_created messages", 
        "Name": "system_admin_created", 
        "TableType": "EXTERNAL_TABLE", 
        "Retention": 0
    }
}

AWS Athena 架构

CREATE EXTERNAL TABLE `system_admin_created`(
  `user_id` STRING COMMENT 'Unique user ID', 
  `group_id` STRING COMMENT 'Unique group ID', 
  `call_time` TIMESTAMP COMMENT 'Date and time the message was published', 
  `date_year` INT COMMENT 'call_time year', 
  `date_month` INT COMMENT 'call_time month', 
  `date_day` INT COMMENT 'call_time day', 
  `given_name` STRING COMMENT 'Given name for user', 
  `time_zone` STRING COMMENT 'IANA time zone for user', 
  `family_name` STRING COMMENT 'Name that links to geneaology', 
  `email` STRING COMMENT 'Email address for user', 
  `language` STRING COMMENT 'RFC BCP 47 code set in this user\'s profile language and region', 
  `phone` STRING COMMENT 'Phone number including ITU-T ITU-T E.164 country codes', 
  `date_created` TIMESTAMP COMMENT 'Date user was created', 
  `role` STRING COMMENT 'User role', 
  `preferences` STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING> COMMENT 'Provider dashboard preferences', 
  `notifications` STRUCT<digest_email:BOOLEAN> COMMENT 'Provider notification settings')
PARTITIONED BY ( 
  `date_year` INT COMMENT 'call_time year', 
  `date_month` INT COMMENT 'call_time month', 
  `date_day` INT COMMENT 'call_time day')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/'
TBLPROPERTIES (
  'classification'='parquet', 
  'parquet.compress'='SNAPPY')

镶木地板工具猫

role = admin
date_created = 2018-01-11T14:40:23.142Z
preferences:
.patients_hidden = false
.weekend_digests = true
.portal_welcome_done = true
email = [email protected]
notifications:
.digest_email = true
group_id = 5a5399df23a804001aa25227
given_name = foo
call_time = 2018-01-11T14:40:23.000Z
time_zone = US/Pacific
family_name = bar
language = en-US
user_id = 5a5777572060a700170240c3

镶木地板工具架构

message spark_schema {
  optional binary role (UTF8);
  optional binary date_created (UTF8);
  optional group preferences {
    optional boolean patients_hidden;
    optional boolean weekend_digests;
    optional boolean portal_welcome_done;
    optional binary last_announcement (UTF8);
  }
  optional binary email (UTF8);
  optional group notifications {
    optional boolean digest_email;
  }
  optional binary group_id (UTF8);
  optional binary given_name (UTF8);
  optional binary call_time (UTF8);
  optional binary time_zone (UTF8);
  optional binary family_name (UTF8);
  optional binary language (UTF8);
  optional binary user_id (UTF8);
  optional binary phone (UTF8);
}
amazon-athena presto
5个回答
8
投票

我遇到了类似的 PrestoException,原因是列类型使用大写字母。一旦我将“VARCHAR(10)”更改为“varchar(10)”,它就起作用了。


1
投票

我将分区键声明为表中的字段。我还遇到了 Parquet 与 Hive 在 TIMESTAMP 方面的差异,并将其切换为 ISO8601 字符串。从那时起,我几乎放弃了,因为如果 s3 存储桶中的所有镶木地板文件没有与 Athena 相同的架构,Athena 会抛出架构错误。然而,对于可选字段和稀疏列,这肯定会发生


1
投票

我也遇到了这个错误,当然,错误消息最终没有告诉我任何有关实际问题的信息。我的错误与原始海报完全相同。

我正在通过 python boto3 API 创建粘合表,并向其提供列名称、类型、分区列和其他一些内容。问题:

这是我用来创建表的代码:

import boto3

glu_clt = boto3.client("glue", region_name="us-east-1")

glue_clt.create_table(
    DatabaseName=database,
    TableInput={
        "Name": table,
        "StorageDescriptor": {
            "Columns": table_cols,
            "Location": table_location,
            "InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
            }
        },
        "PartitionKeys": partition_cols,
        "TableType": "EXTERNAL_TABLE"
    }
)

因此,我结束了为 API 的输入

Columns
定义所有列名称和类型。然后我还为其提供了 API 中输入
PartitionKeys
的列名称和类型。当我浏览到AWS控制台时,我意识到因为我在
Columns
PartitionKeys
中都定义了分区列,所以它在表中定义了两次。

有趣的是,如果您尝试通过控制台执行此操作,它将抛出一个更具描述性的错误,让您知道该列已经存在(如果您尝试添加表中已存在的分区列)。

解决: 我从输入

Columns
中删除了分区列及其类型,而是通过
PartitionKeys
输入将它们提供给它们,这样它们就不会被放在桌子上两次。太令人沮丧了,这最终导致了通过 Athena 查询时与 OP 相同的错误消息。


0
投票

这也可能与您创建数据库的方式(无论是通过 CloudFormation、UI 还是 CLI)或是否有任何禁止字符(例如“-”)有关。我们的数据库和表名称中有连字符,它使许多功能变得无用。


0
投票

请检查胶水中的视图,胶水中的数据类型可能是错误的,我们需要在部署后修复它。 在我的例子中,部署了积分器,但胶水柱仅采用浮动

© www.soinside.com 2019 - 2024. All rights reserved.