如何从 AWS Kinesis Firehose 写入具有 int64 时间戳(而不是 int96)的 Parquet 文件?

问题描述 投票:0回答:2

为什么 int96 时间戳对我不起作用?

我想使用 S3 Select 读取 Parquet 文件。根据文档,S3 Select 不支持保存为 int96 的时间戳。此外,将时间戳存储在 parquet 中为 int96 已已弃用

我尝试了什么?

Firehose 使用

org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
序列化为镶木地板。 (AWS 使用的确切 Hive 版本未知。)在阅读 Hive 代码时,我遇到了以下配置开关:
hive.parquet.write.int64.timestamp
。我尝试通过更改 AWS Glue 表配置中的 Serde 参数来应用此配置开关: 不幸的是,这并没有什么区别,我的时间戳列仍然存储为 int96(通过从 S3 下载文件并使用
parq my-file.parquet --schema
检查它来检查)

hive aws-glue parquet amazon-kinesis-firehose amazon-s3-select
2个回答
0
投票

虽然我无法让 Firehose 写入 int64 时间戳,但我找到了一种解决方法,可以将 S3 Select 查询结果返回的 int96 时间戳转换为有用的内容。

我使用了

中描述的方法

在 JavaScript 中编写以下转换函数:

const hideTimePart = BigInt(64);
const maskToHideJulianDayPart = BigInt('0xffffffffffffffff');
const unixEpochInJulianDay = 2_440_588;
const nanoSecsInOneSec = BigInt(1_000_000_000);
const secsInOneDay = 86_400;
const milliSecsInOneSec = 1_000;

export const parseS3SelectParquetTimeStamp = (ts: string) => {
  const tsBigInt = BigInt(ts);

  const julianDay = Number(tsBigInt >> hideTimePart);
  const secsSinceUnixEpochToStartOfJulianDay = (julianDay - unixEpochInJulianDay) * secsInOneDay;

  const nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
  const secsSinceStartOJulianDay = Number(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);

  return new Date(
    (secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOJulianDay) * milliSecsInOneSec,
  );
};

parseS3SelectParquetTimeStamp('45377606915595481758988800'); // Result: '2022-12-11T20:58:33.000Z'

注意,与预期不同,S3 Select 返回的时间戳将儒略日部分存储在开头,而不是存储在最后 4 个字节中。纳秒时间部分存储在最后 8 个字节中。此外,字节顺序未颠倒

(关于儒略日常数

2440588
:根据
https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields,在这种情况下使用 
2440587.5 是错误的。 html


0
投票

感谢您提供详细信息。这是我在 C# .Net6 中通过 S3select 读取镶木地板文件 (INT96) 中的时间戳的版本:

/// <summary>
/// Parses an INT96 timestamp from a string and returns a DateTime.
/// </summary>
/// <param name="timestamp">The INT96 timestamp as a string.</param>
/// <returns>The DateTime represented by the INT96 timestamp.</returns>
public DateTime ParseS3SelectParquetTimeStamp(string timestamp)
{
    // Define constants
    BigInteger hideTimePart = BigInteger.Pow(2, 64); // Number of bits to hide for the Julian day part
    long unixEpochInJulianDay = 2_440_588; // Julian day of the Unix epoch
    BigInteger nanoSecsInOneSec = BigInteger.Pow(10, 9); // Nanoseconds in one second
    long secsInOneDay = 86_400; // Seconds in one day
    long milliSecsInOneSec = 1_000; // Milliseconds in one second

    // Mask to extract the Julian day part (BigInteger does not support hex)
    BigInteger maskToHideJulianDayPart = BigInteger.Parse("18446744073709551615"); 

    // Parse the timestamp string to a BigInteger
    BigInteger tsBigInt = BigInteger.Parse(timestamp);

    // Extract the Julian day part from the timestamp
    BigInteger julianDay = BigInteger.Divide(tsBigInt, hideTimePart);

    // Calculate the seconds since Unix epoch to start of the Julian day
    long secsSinceUnixEpochToStartOfJulianDay = (long)((julianDay - unixEpochInJulianDay) * secsInOneDay);

    // Extract the nanoseconds since the start of the Julian day
    BigInteger nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;

    // Calculate the seconds since the start of the Julian day
    long secsSinceStartOfJulianDay = (long)(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);

    // Calculate the final DateTime by adding the seconds since Unix epoch to start of Julian day 
    // and seconds since start of Julian day to the Unix epoch
    return new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc)
        .AddSeconds(secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOfJulianDay)
        .ToUniversalTime();
}
© www.soinside.com 2019 - 2024. All rights reserved.