Apache Hive为德鲁伊摄入的数据源提供Null

Question

我通过消费csv在德鲁伊创建了我的数据源

例如：数据：

"2015-09-12T00:47:00.496Z",100134,33,21,30505,43285,U,67c38115-1a68-45bb-858d-dd6cdeaab5cb,
"2015-09-12T00:47:00.496Z",100082,6,26,31548,43202,U,a4f8708a-30ac-4637-910c-e8f9386d6353,

数据是通过德鲁伊下面的json消耗的：indexcsv.json

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/opt/druid-0.12.3/npmData/example.csv"
      }
    },
    "dataSchema" : {
      "dataSource" : "example",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2010-09-12/2018-09-13"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec": {
    "format" : "csv",
    "timestampSpec" : {
      "column" : "timestamp"
    },
    "columns" : ["timestamp","IId","QId","Score","StartOffsetInMs","EndOffsetInMs","SpeakerRole","QueryIdentity","SId"],
    "dimensionsSpec" : {
      "dimensions" : ["IId","QId","SpeakerRole","QueryIdentity","SId"]
    }
  }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        }
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {}
    }
  }
}

我能够在德鲁伊看到这些数据。例如：

[root@ENT-CL-015243 druid-0.12.3]# curl -X 'POST' -H 'Content-Type:application/json' -d @customJsons/groupby-sql.json http://localhost:8082/druid/v2/sql
[{"IId":"1","QId":"26","QueryIdentity":"c5b7d739-a531-409e-afd1-fb294846560a","SpeakerRole":"U","__time":"2015-09-12T00:47:00.496Z","count":1},
{"IId":"1","QId":"30","QueryIdentity":"ba8bb5f5-36e4-41ee-b74c-536b50aa979a","SpeakerRole":"U","__time":"2015-09-12T00:47:00.496Z","count":1},

为了在配置单元中查询这些数据，我遵循以下步骤：

https://cwiki.apache.org/confluence/display/Hive/Druid+Integration#DruidIntegration-QueriescompletelyexecutedinDruid

我打开了hive bash并运行了以下查询：

hive>CREATE EXTERNAL TABLE wikipedia
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" ="example");

hive>  DESCRIBE FORMATTED example;
OK
# col_name              data_type               comment
iid           string                  from deserializer
qid                 string                  from deserializer
queryidentity           string                  from deserializer
speakerrole             string                  from deserializer
__time                  timestamp with local time zone  from deserializer
count                   bigint                  from deserializer

# Detailed Table Information
Database:               default
OwnerType:              USER
Owner:                  root
CreateTime:             Thu Nov 08 13:18:14 IST 2018
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://localhost:9000/user/hive/warehouse/example
Table Type:             EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"__time\":\"true\",\"count\":\"true\",\"iid\":\"true\",\"qid\":\"true\",\"queryidentity\":\"true\",\"speakerrole\":\"true\"}}
EXTERNAL                TRUE
bucketing_version       2
druid.datasource        example
numFiles                0
numRows                 0
rawDataSize             0
storage_handler         org.apache.hadoop.hive.druid.DruidStorageHandler
totalSize               0
transient_lastDdlTime   1541675894

# Storage Information
SerDe Library:          org.apache.hadoop.hive.druid.serde.DruidSerDe
InputFormat:            null
OutputFormat:           null
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.288 seconds, Fetched: 39 row(s)

hive> SELECT * FROM example LIMIT 10;
OK
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
NULL    NULL    NULL    NULL    2015-09-12 03:47:00.496 Asia/Jerusalem  1
Time taken: 0.104 seconds, Fetched: 10 row(s)

正如您所看到的，我对所有列都变为null。我猜可能是某些地方的输入格式相关的东西可以请一些人帮忙。

Answer 1

您可以检查是否配置了以下属性：

hive.druid.broker.address.default：MyIP：8082

hive.druid.coordinator.address.default：MyIP：8081

hive.druid.http.numConnection：20

hive.druid.http.read.timeout：PT10M

hive.druid.indexer.memory.rownum.max:75000

hive.druid.indexer.partition.size.max:1000000

hive.druid.indexer.segments.granularity：DAY

hive.druid.metadata.base：德鲁伊

hive.druid.metadata.db.type:mysql

hive.druid.metadata.password：德鲁伊

hive.druid.metadata.uri:jdbc:mysql:// MyIP：3306 / druid

hive.druid.metadata.username：德鲁伊

hive.druid.storage.storageDirectory：/ apps / hive / warehouse

hive.druid.working.directory：/ tmp / druid-indexing

Answer 2

Druid中的列名称区分大小写，而Hive中的列名称不敏感。以小写字母重命名您在Druid中的列名称，它将正常工作。

Apache Hive为德鲁伊摄入的数据源提供Null

问题描述投票：0回答：2

2个回答

最新问题

Apache Hive为德鲁伊摄入的数据源提供Null

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2