我一直在尝试用 R 进行数据分析练习,其中包含有关销售的日期。 数据框如下:
Order_ID Product
<dbl> <chr>
1 319631 34in Ultrawide Monitor
2 319631 Lightning Charging Cable
3 319596 iPhone
4 319596 Lightning Charging Cable
5 319584 iPhone
6 319584 Wired Headphones
7 319556 Google Phone
8 319556 Wired Headphones
而且我必须找出哪些产品最常一起购买,Order_ID有重复,即它们是由同一个人购买的。
我用 Python 做了这个练习,但我不能在 R 上做。我的 Python 代码是:
pares_compras[['Order ID', 'Product']]
> Order ID Product
2 176560 Google Phone
3 176560 Wired Headphones
17 176574 Google Phone
18 176574 USB-C Charging Cable
29 176585 Bose SoundSport Headphones
pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
>
2 Google Phone,Wired Headphones
3 Google Phone,Wired Headphones
17 Google Phone,USB-C Charging Cable
18 Google Phone,USB-C Charging Cable
pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras
> Order ID Grouped Products
2 176560 Google Phone,Wired Headphones
17 176574 Google Phone,USB-C Charging Cable
29 176585 Bose SoundSport Headphones,Bose SoundSport Hea...
31 176586 AAA Batteries (4-pack),Google Phone
118 176672 Lightning Charging Cable,USB-C Charging Cable
count = Counter()
for row in pares_compras['Grouped Products']:
row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))
count
> Counter({('Google Phone', 'Wired Headphones'): 414,
('Google Phone', 'USB-C Charging Cable'): 987,
('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}
for key, num in count.most_common(5):
print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
所以,这样我就可以解决这个练习,但是,就像我之前说过的,我不能在 R 中做同样的事情,我找不到方法,我刚刚开始使用 R。 如果有人可以帮助我,我将非常感激,thnaks。
嗯,我认为共现矩阵实际上是一个很好的解决方案。
另一种方法是考虑产品概况有多么不同或相似。
orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1, 319631, "34in Ultrawide Monitor"
2, 319631, "Lightning Charging Cable"
3, 319596, "iPhone"
4, 319596, "Lightning Charging Cable"
5, 319584, "iPhone"
6, 319584, "Wired Headphones"
7, 319556, "Google Phone"
8, 319556, "Wired Headphones"') |>
dplyr::mutate(product = trimws(product))
df <- tidyr::pivot_wider(orders,
values_from = product,
names_from = product,
id_cols = order) |>
dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
~!is.na(.x))) |>
select(-order)
cor(df)
dist(t(df))
dist(t(df), method = "binary")
我将把它留在这里作为你的替代方案。
在这里,我列出了两个 data.frame 中的唯一组合,并使用嵌套的
apply
函数检查相同性,在 rowsums
ing之后用
cbind
计算结果
a <- expand.grid(a = df$Product,b = df$Product) |>
rowwise() |>
mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |>
distinct() |>
filter(a != b)
b <- df |>
group_by(Order_ID) |>
summarise(Product = list(c(Product)))
a$count <- rowSums(do.call(cbind,
lapply(b$Product, \(one) sapply(a$c, \(two) +(all(two %in% one))))))
a b count
<chr> <chr> <dbl>
1 34inUltrawideMonitor LightningChargingCable 1
2 34inUltrawideMonitor iPhone 0
3 34inUltrawideMonitor WiredHeadphones 0
4 34inUltrawideMonitor GooglePhone 0
5 iPhone LightningChargingCable 1
6 LightningChargingCable WiredHeadphones 0
7 GooglePhone LightningChargingCable 0
8 iPhone WiredHeadphones 1
9 GooglePhone iPhone 0
10 GooglePhone WiredHeadphones 1
使用
data.table
连接而不是共现矩阵的解决方案。对于更大的数据集(约 3M 行),在我的机器上,它的速度几乎是使用 这个答案中的
crossprod(xtabs(
的两倍。
library(data.table)
library(Matrix) # for comparison with a co-occurrence matrix solution
# Example dataset
n <- 1e6L
orderID <- rep.int(sample.int(n), rpois(n, 1) + 2L)
dt <- unique(data.table(orderID, product = stringi::stri_rand_strings(length(orderID), 2, pattern = "[a-z]")))
# solution using a data.table join
f1 <- function(dt) {
dt2 <- dt[
, x := .I
][
dt,
on = .(orderID = orderID, x > x),
nomatch = 0
][
product > i.product, c("product", "i.product") := list(i.product, product)
][
, .(count = .N), .(product, i.product)
]
dt[, x := NULL]
setnames(dt2, c("product1", "product2", "count"))
setorder(dt2, -count, product1, product2)
}
# co-occurrence matrix solution (slightly modified so the output of the two
# functions is the same)
f2 <- function(dt) {
dt$product <- as.factor(dt$product)
dt4 <- setDT(
summary(
crossprod(
xtabs(~ orderID + product, dt, sparse = TRUE)
)
)
)[
i < j
][
, `:=`(
i = as.character(levels(dt$product)[i]),
j = as.character(levels(dt$product)[j]),
x = as.integer(x)
)
]
dt[, product := as.character(product)]
attr(dt4, "header") <- NULL
setnames(dt4, c("product1", "product2", "count"))
setorder(dt4, -count, product1, product2)
}
Benchmarking:
#> Unit: seconds
#> expr min lq mean median uq max neval
#> f1 1.026762 1.128890 1.224291 1.261732 1.278617 1.362014 10
#> f2 1.984068 2.295159 2.355434 2.403337 2.465651 2.589428 10
我构建了一个简单的程序来确定 Python 和 Pandas 一起销售最多的产品。您无需安装或根据自己的喜好自定义代码即可使用它。详情请见https://nguyenvanthu.com/cach-tim-san-pham-ban-kem-cung-nhau-nhieu-nhat-tu-file-bao-cao-excel/