考虑空值,为整行生成唯一的哈希值

问题描述 投票:0回答:1

我在 PySpark 中有一个示例数据集和代码片段,我尝试为 DataFrame 中的每一行生成哈希值。这是我正在使用的代码:

from pyspark.sql.functions import col, concat_ws, sha2, concat, hash
from datetime import datetime, date 
df = spark.createDataFrame([ 
    ("user1", "Bangalore", 'Grade1', date(2000, 8, 1)), 
    ("user2", "Delhi", 'Grade2', date(2000, 6, 2)),
    ("user3", "Delhi", 'Grade2', date(2000, 6, 2)), 
    ("user4", "Chennai", 'Grade3', date(2000, 5, 3)),
    ("user5", None, 'Grade3', date(2000, 5, 3)) ,
    ("user5", "Grade3", None, date(2000, 5, 3)),
    ("user6", "", 'Grade4', date(2000, 5, 3)) ,
    ("user6", "Grade4", "", date(2000, 5, 3)) 
], schema='userId string, city string, grade string, graduationDate date') 
  
df_hash = (
    df.withColumn(
        "_delta_hash_sha2_concat_ws", sha2(concat_ws("-", *["userId", "city", "grade", "graduationDate"]), 384)
    )
    .withColumn("_delta_hash_sha2_concat", sha2(concat("userId", "city", "grade", "graduationDate"), 256))
    .withColumn("_delta_hash_hash", hash(*["userId", "city", "grade", "graduationDate"]))
)
display(df_hash)  

数据集结果

在我的数据集中,某些行包含空值,我更喜欢保留这些空值,而不是用空字符串或任何其他占位符替换它们。

但是,当使用 sha2 生成哈希值时,我注意到具有空值的行生成相同的哈希值。 例如,在“user5”的情况下,尽管具有不同的空位置,但两行都会生成相同的哈希值。

进一步检查发现,sha2在生成哈希值时似乎没有考虑空值的位置。 因此,我正在寻求有关如何为每行生成唯一哈希值同时保留空值而不替换它们的建议。我想避免使用 row_number 来分配行 ID。

重要提示:我不希望用空字符串或“NULL”替换空值。我的目标是保持空值完好无损。

我很欣赏任何实现这一目标的见解或替代方法。谢谢!

pyspark hash null databricks sha2
1个回答
0
投票

您可以从 4 列构建一个 CSV 字符串并对其调用

sha2

from pyspark.sql import functions as F

df = spark.createDataFrame([
    ("user1", "Bangalore", 'Grade1', date(2000, 8, 1)),
    ("user2", "Delhi", 'Grade2', date(2000, 6, 2)),
    ("user3", "Delhi", 'Grade2', date(2000, 6, 2)),
    ("user4", "Chennai", 'Grade3', date(2000, 5, 3)),
    ("user5", None, 'Grade3', date(2000, 5, 3)),
    ("user5", "Grade3", None, date(2000, 5, 3)),
    ("user6", "", 'Grade4', date(2000, 5, 3)),
    ("user6", "Grade4", "", date(2000, 5, 3))
], schema='userId string, city string, grade string, graduationDate date')

csv_row = F.to_csv(F.struct(*df.columns))
df_hash = df.withColumn('new_hash', F.sha2(csv_row, 256))
df_hash.show(10, False)

# +------+---------+------+--------------+----------------------------------------------------------------+
# |userId|city     |grade |graduationDate|new_hash                                                        |
# +------+---------+------+--------------+----------------------------------------------------------------+
# |user1 |Bangalore|Grade1|2000-08-01    |0fbb6b79b59625c4456ce977a10ba0f0733458a9b2a92a4279d47938bf1aa80a|
# |user2 |Delhi    |Grade2|2000-06-02    |63304ebd30ac4e23288f656087525585ea5815b5ec4b44ee70c0ffbf98cac71d|
# |user3 |Delhi    |Grade2|2000-06-02    |e3f6d69ebda4c96439fb06fc2673ce40455ff07e6e2bf16a5cb209b82d82c739|
# |user4 |Chennai  |Grade3|2000-05-03    |2ca896ea1fcfc8abee6b6505cbe443751498f37dc0420779deec70c3bdf9d8a0|
# |user5 |null     |Grade3|2000-05-03    |f7bd9377bb3e1bbf854e1696584d721045edf0f3cb7da4597aec871f556a81ee|
# |user5 |Grade3   |null  |2000-05-03    |132fb03e23a680c7b8d2c64e1c917826c4918761c1338149874ce2de0cccf1fb|
# |user6 |         |Grade4|2000-05-03    |f57d3844dfc0f1a3ba9084f1d77d927d446e9ca55a88a627ec98ec89ee960d94|
# |user6 |Grade4   |      |2000-05-03    |ec09ecec83543c0a76f9b568698496bcdd378589d7bcc9ad69a669cb99041001|
# +------+---------+------+--------------+----------------------------------------------------------------+

© www.soinside.com 2019 - 2024. All rights reserved.