计算每行的单词数

Question

我正在尝试使用sparklyr将R代码移动到spark中，我遇到了一些函数的麻烦，以便执行以下操作：

- 计算一行中的单词总数：例如word =“你好，你好”，单词数：4 - 计算第一个单词中的字符总数：例如：word =“你好，你好吗” ，第一个单词中的字符数：5

- 计算第一个单词中的字符总数：例如：word =“你好，你好吗”，第二个单词中的字符数：3

我尝试使用dpylr和stringr包但我无法得到我需要的东西。

我连接到一个火花会议

install.packages("DBI")
install.packages("ngram")

require(DBI)
require(sparklyr)
require(dplyr)
require(stringr)
require(stringi)
require(base)
require(ngram)

# Spark Config 

config <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"

spark <- spark_connect(master = "yarn-client",version = "2.3.0",app_name = "Test", config=config)

然后我尝试用SQL语句检索一些数据

test_query<-sdf_sql(spark,"SELECT ID, NAME  FROM table.name LIMIT 10")

NAME <- c('John Doe','Peter Gynn','Jolie Hope')
ID<-c(1,2,3)

test_query<-data.frame(NAME,ID) # ( this is the example data, here it is in R data frame, but I have on a Spark Data Frame)

当我尝试进行特征工程时，我在最后一行收到错误

test_query<-test_query %>% 
mutate(Total_char=nchar(NAME))%>% # this works good
mutate(Name_has_numbers=str_detect(NAME,"[[:digit:]]"))%>% # Works good
mutate(Total_words=str_count(NAME, '\\w+')) # I got an error

我得到的错误消息是这样的：错误：org.apache.spark.sql.AnalysisException：未定义的函数：'STR_COUNT'。此函数既不是已注册的临时函数，也不是在数据库'default'中注册的永久函数。

- 计算一行中的单词总数：例如word =“你好，你好”，单词数：4 - 计算第一个单词中的字符总数：例如：word =“你好，你好吗” ，第一个单词中的字符数：5

- 计算第一个单词中的字符总数：例如：word =“你好，你好吗”，第二个单词中的字符数：3

Answer 1

> library(tidyverse)
> test_query %>% 
      mutate(NAME = as.character(NAME),
        word_count = str_count(NAME, "\\w+"),     # Count the total number of words in a row
           N_char_first_word = nchar((gsub("(\\w+).*", "\\1", NAME)))  #Count the total number of character in the first word
                    )
        NAME ID word_count N_char_first_word
1   John Doe  1          2                 4
2 Peter Gynn  2          2                 5
3 Jolie Hope  3          2                 5

计算每行的单词数

问题描述投票：0回答：1

1个回答

最新问题

计算每行的单词数

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1