在Cloudera中使用serde加载JSON文件

Question

我正在尝试使用具有此包结构的 JSON 文件：

{
   "user_id": "kim95",
   "type": "Book",
   "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
   "year": "1995",
   "publisher": "ACM Press and Addison-Wesley",
   "authors": [
      {
         "name": "null"
      }
   ],
   "source": "DBLP"
}
{
   "user_id": "marshallo79",
   "type": "Book",
   "title": "Inequalities: Theory of Majorization and Its Application.",
   "year": "1979",
   "publisher": "Academic Press",
   "authors": [
      {
         "name": "Albert W. Marshall" 
      },
      {
         "name": "Ingram Olkin"
      }
   ],
   "source": "DBLP"
}

我尝试使用serde为Hive加载JSON数据。我按照我在这里看到的两种方式进行操作：http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

使用此代码：

CREATE EXTERNAL TABLE IF NOT EXISTS serd (
           user_id:string, 
           type:string, 
           title:string,
           year:string,
           publisher:string,
           authors:array<struct<name:string>>,
           source:string)       
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';

我收到此错误：

error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type

我也尝试过这个版本：https://github.com/rcongiu/Hive-JSON-Serde

这给出了不同的错误：

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde

有什么想法吗？

我还想知道使用像这样的 JSON 来查询“作者”中的“名称”字段的替代方法。是 Pig 还是 Hive？

我已经将其转换为“tsv”文件。但是，由于我的作者列是一个元组，如果我从此文件构建一个表，我不知道如何使用 Hive 对“名称”发出请求。我应该更改“tsv”转换脚本还是保留它？或者有 Hive 或 Pig 的替代品吗？

Answer 1

Hive 没有内置对 JSON 的支持。因此，为了将 JSON 与 Hive 结合使用，我们需要使用第三方 jar，例如： https://github.com/rcongiu/Hive-JSON-Serde

创建表语句有几个问题。它应该看起来像这样：

CREATE EXTERNAL TABLE IF NOT EXISTS serd ( 
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...

您使用的 JSON 记录将每条记录保留在一行中，如下所示：

{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"} 
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}

从 GIT 下载项目后，您需要编译该项目，这将创建一个 jar，您需要在运行 create table 语句之前将该 jar 添加到 Hive 会话中。

希望有帮助...!!!

Answer 2

add jar 仅添加到会话中，而会话不可用，最后出现错误。将 JAR 加载到 Hive 和 MapReduce 路径上的所有节点上，如下所示，以便 HIVE 和 MapReduce 组件在调用时都会选择它。

/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependency.jar
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies。罐子

注意：此路径因集群而异。

在Cloudera中使用serde加载JSON文件

问题描述投票：0回答：2

2个回答

最新问题

在Cloudera中使用serde加载JSON文件

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2