当数字数据与字符混合时如何对其进行分箱

问题描述 投票:0回答:1

假设我在 R 中有 3 列数据:

  1. 类型:'A' 'B' 'C' 'D' 'E' 'F' 'G'
  2. 值:UT 30 45 50 62 70 72
  3. 效率:70 72 80 88 90 92 98

我想仅将“数值”数据以 20 的增量存储在“值”列中,并将其显示在 X 轴上,同时保留“文本”值[因此 x 轴将显示:UT 30- 50 50-70 70-90],同时在 Y 轴上显示“效率”,颜色 = 类型。

分箱数字数据类型似乎很简单: 垃圾箱 <- seq(30, 80, by = 20) then plotting, but having that 'UT' is giving me a real challenge.

我是菜鸟;刚刚玩过

#create bins
bins <- seq(30, 90, by = 20) 

#create plot 
ggplot(df, aes(x = cut(`Value`, breaks = bins, labels = sprintf("%d-%d", bins[-length(bins)], bins[-1])), y = 'Efficiency', color = `Type`)) 
r ggplot2 binning
1个回答
0
投票
df <- data.frame(
  Type = c("A", "B", "C", "D", "E", "F", "G"),
  Value = c("UT", "30", "45", "50", "62", "70", "72"),
  Efficiency = c(70, 72, 80, 88, 90, 92, 98)
)

退一步考虑数据集的组织。数据帧中的列是单一类型的向量,因此通过在该列中包含“UT”,这是一个字符列:

dplyr::glimpse(df)
Rows: 7
Columns: 3
$ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
$ Value      <chr> "UT", "30", "45", "50", "62", "70", "72"
$ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98

您的

Value
列似乎源自数字数据,但您有那个讨厌的
UT
。在这种情况下,我认为最好更详细地了解数据框,以便您可以更准确地描述情况:

# New column to give context to the value column
df$Value_type <- ifelse(df$Value == "UT", "UT", NA)
df$Value[df$Value == "UT"] <- NA
df$Value <- as.numeric(df$Value)

# Categories for the Value column
df <- df |>
  dplyr::mutate(Value_cat = dplyr::case_when(
    30 <= Value & Value < 50 ~ "30-50",
    50 <= Value & Value < 70 ~ "50-70",
    70 <= Value & Value < 90 ~ "70-90",
    Value_type == "UT" ~ "UT",
    .default = NA
  ))

# Set factor levels so that any plots have desired order 
df$Value_cat <- factor(df$Value_cat, levels = c(
  "UT", "30-50", "50-70", "70-90"
))

dplyr::glimpse(df)
Rows: 7
Columns: 5
$ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
$ Value      <dbl> NA, 30, 45, 50, 62, 70, 72
$ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98
$ Value_type <chr> "UT", NA, NA, NA, NA, NA, NA
$ Value_cat  <fct> UT, 30-50, 30-50, 50-70, 50-70, 70-90, 70-90

现在我们更接近传说中的“整洁”数据集:其中行是每个单独的观察结果,每个观察结果都具有在单独的变量/列中明确定义的特征。您还可以方便地访问数值向量中的

Value
测量值。当您需要编写有关数字属性的任何类型的逻辑时,通常最好避免将数字转换为字符串。

由于您有一个类别“UT”,因此您的 x 轴实际上是类别,而不是数字。我建议使用从这些类别创建的“箱”条形图,而不是设置箱宽度的直方图,因为直方图通常用于连续数值数据。

library(ggplot2)
# Since your `fill` is not 1:1 with your x-axis, you can set
# the position of the bars in the position argument
ggplot(data = df, mapping = aes(x = Value_cat, y = Efficiency)) +
  geom_col(aes(fill = Type), position = position_dodge())

© www.soinside.com 2019 - 2024. All rights reserved.