如何返回由|分隔的所有可能类别在一栏下

Question

我有一个名为“genre”的数据集“movie”，其值类似于“Action”，“Action | Animation”，“Animation | Fantasy”。电影可以有多种类型。我想输出所有可能的单个类别（例如冒险，幻想）及其频率的列表。换句话说，我想知道有多少电影有类型“动作”，有多少有“幻想”。我不关心这些组合。对此有什么建议吗？

Answer 1

这是使用R在sapply基地做一个简单的方法

# sample data frame
df <- data.frame(genre=c("Action", "Action|Animation", "Animation|Fantasy"), stringsAsFactors = F)

# get uniq genre
uniq.genre <- unique(unlist(strsplit(df$genre, split = '\\|')))

# get frequency
sapply(uniq.genre, function(genre) {
  sum(grepl(genre, df$genre))
})
#>    Action Animation   Fantasy 
#>         2         2         1

Answer 2

一个选项，如果没有太多的类型，是使用函数grepl()，它将告诉你一个特定的字符串（如'Action'）是否出现在一个字符中（如'Animation|Fantasy'）：

library(dplyr)
library(stringr)

data.frame(
  genre = c('Action', 'Fantasy|Action', 'Animation|Fantasy')
) %>% 
  mutate(
    isAction    = grepl('Action', genre),
    isAdventure = grepl('Adventure', genre),
    isAnimation = grepl('Animation', genre),
    isComedy    = grepl('Comedy', genre),
    isFantasay  = grepl('Fantasy', genre)
  )

#               genre isAction isAdventure isAnimation isComedy isFantasay
# 1            Action     TRUE       FALSE       FALSE    FALSE      FALSE
# 2    Fantasy|Action     TRUE       FALSE       FALSE    FALSE       TRUE
# 3 Animation|Fantasy    FALSE       FALSE        TRUE    FALSE       TRUE

Answer 3

如果目的是找到每个类型的频率，那么我们在分类符split的'genre'列上做|并使用mtabulate

library(qdapTools)
mtabulate(strsplit(as.character(df1$genre), "|", fixed = TRUE))

或者使用table的base R

dat <- stack(setNames(strsplit(as.character(df1$genre), "|", 
           fixed = TRUE), seq_len(nrow(df1))))
lvls <- c("Action', 'Adventure', 'Animation', 'Comedy', 'Fantasy')
dat$values <- factor(dat$values, levels = lvls)
table(dat[2:1])

注意：假设在数据集中找到所有类别

如何返回由|分隔的所有可能类别在一栏下

问题描述投票：2回答：3

3个回答

最新问题

如何返回由|分隔的所有可能类别在一栏下

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3