基于特定分布按组从数据帧中采样

Question

我正在尝试从数据框中进行采样，但条件是样本代表了我的案例中特定标准的分布。数据框的结构如下：

df <- data.frame(Locaton = c('A', 'B', 'B', 'B', 'C', 'C', ...),
                 Veg_Species = c('X', 'Y', 'Z', 'Z', 'Z', 'Z', ...),
                 Date_Diff = c(2, 5, 2, 0, 4, 4, ...))

重要的是要知道每个

Veg_Species

的数量不同。这意味着

出现了 25 次，例如

45 次和

78 次。现在我想根据

最小

样本的 Veg_Species 的分布从不同的 Date_Diff 中进行采样。在这种情况下，这意味着根据

Date_diff

的

分布从每个物种中进行采样。

我想我可以用

dplyr

做到这一点：

sample.species <- df %>%
  filter(Veg_Species == 'Z') %>%
  sample_n(25, replace = TRUE)

但这显然只是从所有名为

Veg_Species

的

中随机采样。

我怎样才能将分布也考虑在内？

有关更详细的示例，请单击此处。

Answer 1

也许您可以尝试对

Date_Diff

的分布进行核密度估计。

1。数据和包装

df <- read.csv("http://www.sharecsv.com/dl/2a26bf2c69bfd76e8ddcecd1c3739a31/ex.csv", row.names = 1)
library(dplyr)

2。找到最小的物种

df %>% count(Species)

#                   Species  n
# 1 Adenostoma fasciculatum 95
# 2     Artemisia filifolia 26
# 3  Eriogonum fasciculatum 41
# 4              Tamarix L. 27

3.分布的核密度估计和线性插值

（参考：https://stats.stackexchange.com/a/78775/218516）

val <- df$Date_Diff[df$Species == "Artemisia filifolia"]
dist.fun <- approxfun(density(val))

4。取样

（自 sample_n()

 起，

slice_sample()

已被
dplyr 1.0.0
取代。）

df2 <- df %>%
  group_by(Species) %>% 
  slice_sample(n = 26, weight_by = dist.fun(Date_Diff)) %>%
  ungroup()

5。检查

df2 %>% count(Species)

#   Species                     n
#   <chr>                   <int>
# 1 Adenostoma fasciculatum    26
# 2 Artemisia filifolia        26
# 3 Eriogonum fasciculatum     26
# 4 Tamarix L.                 26

Answer 2

在我看来，您想对数据集进行采样，但保持 X 子集中存在的

Date_diff

的分布。

首先您需要确定 X 子集中存在什么。我做了一些看起来像你的假数据：

set.seed(123)
df <- data.frame(Location = sample(LETTERS[1:3], 148, replace = TRUE),
                 Veg_Species = c(rep("X", 25), rep("Y", 45), rep("Z", 78)),
                 Date_Diff = trunc(runif(148, 0, 10)))

现在，我们需要

Date_Diff

的分布。我们可以用

Veg_Species = X

来做到这一点：

dplyr

现在我们过滤原始数据

library(dplyr)
x_dist <- df %>%
  filter(Veg_Species == "X") %>%
  group_by(Date_Diff) %>%
  summarize(count = n())
x_dist
A tibble: 8 x 2
  Date_Diff count
      <dbl> <int>
1         1     2
2         2     6
3         3     5
4         4     3
5         5     3
6         6     2
7         7     2
8         8     2

，并通过

nest_by(Date_Diff)

中的

data

对每个

count

进行采样。

x_dist

Answer 3

set.seed(345)
df_sample <- df %>%
  semi_join(x_dist) %>%  # Remove all rows with Date_Diff not in x_dist
  nest_by(Date_Diff) %>%
  inner_join(x_dist) %>% 
  mutate(data = list(data[sample(1:nrow(data), # sampling the data
                                 size = count, 
                                 replace = TRUE),])) %>%
  summarize(data) %>%    # unnesting the data
  select(Location, veg_Species, 
         Date_Diff, -count) # reordering columns and removing count
df_sample
# A tibble: 25 x 3
# Groups:   Date_Diff [8]
   Location Veg_Species Date_Diff
   <chr>    <chr>           <dbl>
 1 C        Z                   1
 2 A        Z                   1
 3 A        Y                   2
 4 C        Z                   2
 5 B        X                   2
 6 B        Z                   2
 7 B        X                   2
 8 B        X                   2
 9 A        Y                   3
10 A        X                   3
# ... with 15 more rows

中的参数

prob=

是样本每个元素的权重向量。我的想法是使用每行的索引和权重向量进行采样。这将保留分布。

sample()

考虑您案例中的分布权重进行采样：

sample_by_distribution <- function(df, dist_weights_col, n, replace=FALSE) { sampled_indexes <- sample(x=1:nrow(df), size=n, replace=replace, prob = df[, dist_weights_col]) df[sampled_indexes,] }

这将对

sample_df <- sample_by_distribution(df, "Date_Diff", 25, replace=FALSE)

的 25 行进行采样，而每行的概率遵循“Date_Diff”列。因此，“Veg_Species”的分布也应该保留。

基于特定分布按组从数据帧中采样

问题描述投票：0回答：3

3个回答

最新问题

基于特定分布按组从数据帧中采样

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3