我无法对我的 AWS Glue 分区表进行任何查询。我收到的错误是
HIVE_METASTORE_ERROR:com.facebook.presto.spi.PrestoException:错误: 预期在“STRING”的位置 0 处键入,但找到了“STRING”。 (服务:空;状态代码:0;错误代码:空;请求 ID:空)
我发现另一个线程提出了这样一个事实:数据库名称和表不能包含字母数字和下划线以外的字符。因此,我确保数据库名称、表名称和所有列名称都遵守此限制。唯一不遵守此限制的对象是我的 s3 存储桶名称,该名称很难更改。
以下是数据的表定义和 parquet-tools 转储。
{
"Table": {
"UpdateTime": 1545845064.0,
"PartitionKeys": [
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
}
],
"StorageDescriptor": {
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SortColumns": [],
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Name": "ser_de_info_system_admin_created",
"Parameters": {
"serialization.format": "1"
}
},
"BucketColumns": [],
"Parameters": {},
"Location": "s3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/",
"NumberOfBuckets": 0,
"StoredAsSubDirectories": false,
"Columns": [
{
"Comment": "Unique user ID",
"Type": "STRING",
"Name": "user_id"
},
{
"Comment": "Unique group ID",
"Type": "STRING",
"Name": "group_id"
},
{
"Comment": "Date and time the message was published",
"Type": "TIMESTAMP",
"Name": "call_time"
},
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
},
{
"Comment": "Given name for user",
"Type": "STRING",
"Name": "given_name"
},
{
"Comment": "IANA time zone for user",
"Type": "STRING",
"Name": "time_zone"
},
{
"Comment": "Name that links to geneaology",
"Type": "STRING",
"Name": "family_name"
},
{
"Comment": "Email address for user",
"Type": "STRING",
"Name": "email"
},
{
"Comment": "RFC BCP 47 code set in this user's profile language and region",
"Type": "STRING",
"Name": "language"
},
{
"Comment": "Phone number including ITU-T ITU-T E.164 country codes",
"Type": "STRING",
"Name": "phone"
},
{
"Comment": "Date user was created",
"Type": "TIMESTAMP",
"Name": "date_created"
},
{
"Comment": "User role",
"Type": "STRING",
"Name": "role"
},
{
"Comment": "Provider dashboard preferences",
"Type": "STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING>",
"Name": "preferences"
},
{
"Comment": "Provider notification settings",
"Type": "STRUCT<digest_email:BOOLEAN>",
"Name": "notifications"
}
],
"Compressed": true
},
"Parameters": {
"classification": "parquet",
"parquet.compress": "SNAPPY"
},
"Description": "System wide admin_created messages",
"Name": "system_admin_created",
"TableType": "EXTERNAL_TABLE",
"Retention": 0
}
}
CREATE EXTERNAL TABLE `system_admin_created`(
`user_id` STRING COMMENT 'Unique user ID',
`group_id` STRING COMMENT 'Unique group ID',
`call_time` TIMESTAMP COMMENT 'Date and time the message was published',
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day',
`given_name` STRING COMMENT 'Given name for user',
`time_zone` STRING COMMENT 'IANA time zone for user',
`family_name` STRING COMMENT 'Name that links to geneaology',
`email` STRING COMMENT 'Email address for user',
`language` STRING COMMENT 'RFC BCP 47 code set in this user\'s profile language and region',
`phone` STRING COMMENT 'Phone number including ITU-T ITU-T E.164 country codes',
`date_created` TIMESTAMP COMMENT 'Date user was created',
`role` STRING COMMENT 'User role',
`preferences` STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING> COMMENT 'Provider dashboard preferences',
`notifications` STRUCT<digest_email:BOOLEAN> COMMENT 'Provider notification settings')
PARTITIONED BY (
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/'
TBLPROPERTIES (
'classification'='parquet',
'parquet.compress'='SNAPPY')
role = admin
date_created = 2018-01-11T14:40:23.142Z
preferences:
.patients_hidden = false
.weekend_digests = true
.portal_welcome_done = true
email = [email protected]
notifications:
.digest_email = true
group_id = 5a5399df23a804001aa25227
given_name = foo
call_time = 2018-01-11T14:40:23.000Z
time_zone = US/Pacific
family_name = bar
language = en-US
user_id = 5a5777572060a700170240c3
message spark_schema {
optional binary role (UTF8);
optional binary date_created (UTF8);
optional group preferences {
optional boolean patients_hidden;
optional boolean weekend_digests;
optional boolean portal_welcome_done;
optional binary last_announcement (UTF8);
}
optional binary email (UTF8);
optional group notifications {
optional boolean digest_email;
}
optional binary group_id (UTF8);
optional binary given_name (UTF8);
optional binary call_time (UTF8);
optional binary time_zone (UTF8);
optional binary family_name (UTF8);
optional binary language (UTF8);
optional binary user_id (UTF8);
optional binary phone (UTF8);
}
我遇到了类似的 PrestoException,原因是列类型使用大写字母。一旦我将“VARCHAR(10)”更改为“varchar(10)”,它就起作用了。
我将分区键声明为表中的字段。我还遇到了 Parquet 与 Hive 在 TIMESTAMP 方面的差异,并将其切换为 ISO8601 字符串。从那时起,我几乎放弃了,因为如果 s3 存储桶中的所有镶木地板文件没有与 Athena 相同的架构,Athena 会抛出架构错误。然而,对于可选字段和稀疏列,这肯定会发生
我也遇到了这个错误,当然,错误消息最终没有告诉我任何有关实际问题的信息。我的错误与原始海报完全相同。
我正在通过 python boto3 API 创建粘合表,并向其提供列名称、类型、分区列和其他一些内容。问题:
这是我用来创建表的代码:
import boto3
glu_clt = boto3.client("glue", region_name="us-east-1")
glue_clt.create_table(
DatabaseName=database,
TableInput={
"Name": table,
"StorageDescriptor": {
"Columns": table_cols,
"Location": table_location,
"InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
}
},
"PartitionKeys": partition_cols,
"TableType": "EXTERNAL_TABLE"
}
)
因此,我结束了为 API 的输入
Columns
定义所有列名称和类型。然后我还为其提供了 API 中输入 PartitionKeys
的列名称和类型。当我浏览到AWS控制台时,我意识到因为我在Columns
和PartitionKeys
中都定义了分区列,所以它在表中定义了两次。
有趣的是,如果您尝试通过控制台执行此操作,它将抛出一个更具描述性的错误,让您知道该列已经存在(如果您尝试添加表中已存在的分区列)。
解决: 我从输入
Columns
中删除了分区列及其类型,而是通过 PartitionKeys
输入将它们提供给它们,这样它们就不会被放在桌子上两次。太令人沮丧了,这最终导致了通过 Athena 查询时与 OP 相同的错误消息。
这也可能与您创建数据库的方式(无论是通过 CloudFormation、UI 还是 CLI)或是否有任何禁止字符(例如“-”)有关。我们的数据库和表名称中有连字符,它使许多功能变得无用。
请检查胶水中的视图,胶水中的数据类型可能是错误的,我们需要在部署后修复它。 在我的例子中,部署了积分器,但胶水柱仅采用浮动