ggplot2 Boxplot显示的中位数与计算得出的中位数不同

Question

我正在根据大数据（215万个案例）绘制一个按年划分的两组权重的简单箱线图。除去年的最后一组外，所有组的中位数均相同，但在箱图中，其绘制方式彼此相同。

 #boxplot
ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA)+
  ylim(0,850)


#median by group
pivot <- dataset %>%
  select(SUM_MME_mg,GenderPerson,Year )%>%
  group_by(Year, GenderPerson) %>%
  summarise(MedianValues = median(SUM_MME_mg,na.rm=TRUE))

我无法弄清楚我做错了什么，或者哪种数据在箱线图计算或中位数函数中更准确。 R不返回任何错误或警告。

 #my data:
> dput(head(dataset[,c(1,7,10)]))
structure(list(GenderPerson = c(2L, 1L, 2L, 2L, 2L, 2L), Year = c("2015", 
"2014", "2013", "2012", "2011", "2015"), SUM_MME_mg = c(416.16, 
131.76, 790.56, 878.4, 878.4, 878.4)), row.names = c(NA, 6L), class = "data.frame")

Answer 1

此行为的原因与ylim()的运行方式有关。 ylim()是scale_y_continuous(limits=...的便捷功能/包装器。如果将look into the documentation功能设置为scale_continuous，则会看到设置限制不仅会放大某个区域，而且实际上也删除该区域之外的所有数据点。这发生在计算/统计功能之前，因此这就是为什么使用ylim()时中位数不同的原因。您的计算“外部” ggplot()正在获取整个数据集，而使用ylim()意味着在进行计算之前已删除了数据点。

幸运的是，有一个简单的解决方法，就是使用coord_cartesian(ylim=...)代替ylim()，因为coord_cartesian()只会放大数据而不会删除数据点。在这里看到区别：

ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA) + ylim(0,850)

ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim=c(0,850))

此行为的提示也很明显，因为使用ylim()的第一个代码块也应给您警告消息：

Warning message:
Removed 3 rows containing non-finite values (stat_boxplot).

而第二个不使用coord_cartesian(ylim=。

ggplot2 Boxplot显示的中位数与计算得出的中位数不同

问题描述投票：0回答：1

1个回答

最新问题

ggplot2 Boxplot显示的中位数与计算得出的中位数不同

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1