sparklyr - 包括在Apache星火加入空值

问题描述 投票:2回答:1

这个问题Including null values in an Apache Spark Join有斯卡拉,PySpark和SparkR答案,但不为sparklyr。我一直无法弄清楚如何在sparklyr对待空值inner_join在连接列相等。有谁知道这是如何在sparklyr做什么?

r apache-spark join dplyr sparklyr
1个回答
1
投票

您可以调用一个隐含的交叉连接:

#' Return a Cartesian product of Spark tables
#'
#' @param df1 tbl_spark
#' @param df2 tbl_spark
#' @param explicit logical If TRUE use crossJoin otherwise 
#'   join without expression
#' @param suffix character suffixes to be used on duplicate names
cross_join <- function(df1, df2, 
    explicit = FALSE, suffix = c("_x", "_y")) {

  common_cols <- intersect(colnames(df1), colnames(df2))

  if(length(common_cols) > 0) {
    df1 <- df1 %>% rename_at(common_cols, funs(paste0(., suffix[1])))
    df2 <- df2 %>% rename_at(common_cols, funs(paste0(., suffix[2])))
  }

  sparklyr::invoke(
    spark_dataframe(df1), 
    if(explicit) "crossJoin" else "join", 
    spark_dataframe(df2)) %>% sdf_register()
}

和过滤器与IS NOT DISTINCT FROM结果

# Enable Cross joins
sc %>% 
  spark_session() %>% 
  sparklyr::invoke("conf") %>%
  sparklyr::invoke("set", "spark.sql.crossJoin.enabled", "true")

df1 <- copy_to(sc, tibble(id1 = c(NA, "foo", "bar"), val = 1:3))
df2 <- copy_to(sc, tibble(id2 = c(NA, "foo", "baz"), val = 4:6))

df1 %>%
  cross_join(df2) %>% 
  filter(id1 %IS NOT DISTINCT FROM% id2)
# Source: spark<?> [?? x 4]
  id1   val_x id2   val_y
* <chr> <int> <chr> <int>
1 NA        1 NA        4
2 foo       2 foo       5

optimized execution plan

<jobj[62]>
  org.apache.spark.sql.catalyst.plans.logical.Join
  Join Inner, (id1#10 <=> id2#76)
:- Project [id1#10, val#11 AS val_x#129]
:  +- InMemoryRelation [id1#10, val#11], StorageLevel(disk, memory, deserialized, 1 replicas)
:        +- Scan ExistingRDD[id1#10,val#11]
+- Project [id2#76, val#77 AS val_y#132]
   +- InMemoryRelation [id2#76, val#77], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- Scan ExistingRDD[id2#76,val#77]

<=>操作者应工作方式相同:

df1 %>%
  cross_join(df2) %>% 
  filter(id1 %<=>% id2)

请注意:

  • 隐式交叉连接会失败,如果后面没有选择这促进了结果哈希连接/分类合并连接,或交叉连接is enabled
  • 明确交叉联接不应该在这种情况下使用,因为它会优先于后续选择。
  • 它可以使用dplyr风格交叉连接: mutate(df1, `_const` = TRUE) %>% inner_join(mutate(df2, `_const` = TRUE), by = c("_const")) %>% select(-`_const`) %>% filter(id1 %IS NOT DISTINCT FROM% id2) 但我建议不要说,因为它是不太可靠(根据上下文优化程序可能无法识别该_const是恒定的)。
© www.soinside.com 2019 - 2024. All rights reserved.