我想使用 S3 Select 读取 Parquet 文件。根据文档,S3 Select 不支持保存为 int96 的时间戳。此外,将时间戳存储在 parquet 中为 int96 已已弃用。
Firehose 使用
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
序列化为镶木地板。 (AWS 使用的确切 Hive 版本未知。)在阅读 Hive 代码时,我遇到了以下配置开关:hive.parquet.write.int64.timestamp
。我尝试通过更改 AWS Glue 表配置中的 Serde 参数来应用此配置开关:
不幸的是,这并没有什么区别,我的时间戳列仍然存储为 int96(通过从 S3 下载文件并使用 parq my-file.parquet --schema
检查它来检查)
虽然我无法让 Firehose 写入 int64 时间戳,但我找到了一种解决方法,可以将 S3 Select 查询结果返回的 int96 时间戳转换为有用的内容。
我使用了
中描述的方法在 JavaScript 中编写以下转换函数:
const hideTimePart = BigInt(64);
const maskToHideJulianDayPart = BigInt('0xffffffffffffffff');
const unixEpochInJulianDay = 2_440_588;
const nanoSecsInOneSec = BigInt(1_000_000_000);
const secsInOneDay = 86_400;
const milliSecsInOneSec = 1_000;
export const parseS3SelectParquetTimeStamp = (ts: string) => {
const tsBigInt = BigInt(ts);
const julianDay = Number(tsBigInt >> hideTimePart);
const secsSinceUnixEpochToStartOfJulianDay = (julianDay - unixEpochInJulianDay) * secsInOneDay;
const nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
const secsSinceStartOJulianDay = Number(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);
return new Date(
(secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOJulianDay) * milliSecsInOneSec,
);
};
parseS3SelectParquetTimeStamp('45377606915595481758988800'); // Result: '2022-12-11T20:58:33.000Z'
注意,与预期不同,S3 Select 返回的时间戳将儒略日部分存储在开头,而不是存储在最后 4 个字节中。纳秒时间部分存储在最后 8 个字节中。此外,字节顺序未颠倒。
(关于儒略日常数
2440588
:根据 https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields,在这种情况下使用
2440587.5
是错误的。 html)
感谢您提供详细信息。这是我在 C# .Net6 中通过 S3select 读取镶木地板文件 (INT96) 中的时间戳的版本:
/// <summary>
/// Parses an INT96 timestamp from a string and returns a DateTime.
/// </summary>
/// <param name="timestamp">The INT96 timestamp as a string.</param>
/// <returns>The DateTime represented by the INT96 timestamp.</returns>
public DateTime ParseS3SelectParquetTimeStamp(string timestamp)
{
// Define constants
BigInteger hideTimePart = BigInteger.Pow(2, 64); // Number of bits to hide for the Julian day part
long unixEpochInJulianDay = 2_440_588; // Julian day of the Unix epoch
BigInteger nanoSecsInOneSec = BigInteger.Pow(10, 9); // Nanoseconds in one second
long secsInOneDay = 86_400; // Seconds in one day
long milliSecsInOneSec = 1_000; // Milliseconds in one second
// Mask to extract the Julian day part (BigInteger does not support hex)
BigInteger maskToHideJulianDayPart = BigInteger.Parse("18446744073709551615");
// Parse the timestamp string to a BigInteger
BigInteger tsBigInt = BigInteger.Parse(timestamp);
// Extract the Julian day part from the timestamp
BigInteger julianDay = BigInteger.Divide(tsBigInt, hideTimePart);
// Calculate the seconds since Unix epoch to start of the Julian day
long secsSinceUnixEpochToStartOfJulianDay = (long)((julianDay - unixEpochInJulianDay) * secsInOneDay);
// Extract the nanoseconds since the start of the Julian day
BigInteger nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
// Calculate the seconds since the start of the Julian day
long secsSinceStartOfJulianDay = (long)(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);
// Calculate the final DateTime by adding the seconds since Unix epoch to start of Julian day
// and seconds since start of Julian day to the Unix epoch
return new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc)
.AddSeconds(secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOfJulianDay)
.ToUniversalTime();
}