哪些产品最常一起销售? - 分析练习

问题描述 投票:0回答:4

我一直在尝试用 R 进行数据分析练习,其中包含有关销售的日期。 数据框如下:

   Order_ID Product                 
      <dbl> <chr>                   
 1   319631 34in Ultrawide Monitor  
 2   319631 Lightning Charging Cable
 3   319596 iPhone                  
 4   319596 Lightning Charging Cable
 5   319584 iPhone                  
 6   319584 Wired Headphones        
 7   319556 Google Phone            
 8   319556 Wired Headphones

而且我必须找出哪些产品最常一起购买,Order_ID有重复,即它们是由同一个人购买的。

我用 Python 做了这个练习,但我不能在 R 上做。我的 Python 代码是:

pares_compras[['Order ID', 'Product']]

>  Order ID Product
2   176560  Google Phone
3   176560  Wired Headphones
17  176574  Google Phone
18  176574  USB-C Charging Cable
29  176585  Bose SoundSport Headphones

pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))

>
2                             Google Phone,Wired Headphones
3                             Google Phone,Wired Headphones
17                        Google Phone,USB-C Charging Cable
18                        Google Phone,USB-C Charging Cable

pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras

>   Order ID    Grouped Products
2   176560  Google Phone,Wired Headphones
17  176574  Google Phone,USB-C Charging Cable
29  176585  Bose SoundSport Headphones,Bose SoundSport Hea...
31  176586  AAA Batteries (4-pack),Google Phone
118 176672  Lightning Charging Cable,USB-C Charging Cable

count = Counter()

for row in pares_compras['Grouped Products']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))
count

> Counter({('Google Phone', 'Wired Headphones'): 414,
         ('Google Phone', 'USB-C Charging Cable'): 987,
         ('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}

for key, num in count.most_common(5):
    print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361

所以,这样我就可以解决这个练习,但是,就像我之前说过的,我不能在 R 中做同样的事情,我找不到方法,我刚刚开始使用 R。 如果有人可以帮助我,我将非常感激,thnaks。

r dplyr
4个回答
0
投票

嗯,我认为共现矩阵实际上是一个很好的解决方案。

另一种方法是考虑产品概况有多么不同或相似。

orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1,   319631, "34in Ultrawide Monitor"
2,   319631, "Lightning Charging Cable"
3,   319596, "iPhone"
4,   319596, "Lightning Charging Cable"
5,   319584, "iPhone"
6,   319584, "Wired Headphones"
7,   319556, "Google Phone"
8,   319556, "Wired Headphones"')  |>
  dplyr::mutate(product = trimws(product))

df <- tidyr::pivot_wider(orders,
                   values_from = product, 
                   names_from = product, 
                   id_cols = order) |>
  dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
                        ~!is.na(.x))) |>
  select(-order) 
cor(df)

dist(t(df))
dist(t(df), method = "binary")





0
投票

我将把它留在这里作为你的替代方案。

在这里,我列出了两个 data.frame 中的唯一组合,并使用嵌套的

apply
函数检查相同性,在
rowsums
ing
 之后用 
cbind

计算结果
a <- expand.grid(a = df$Product,b = df$Product) |>
  rowwise() |> 
  mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |> 
  distinct() |> 
  filter(a != b)
  
  b <- df |> 
  group_by(Order_ID) |> 
  summarise(Product = list(c(Product)))

     
a$count <- rowSums(do.call(cbind, 
lapply(b$Product, \(one) sapply(a$c, \(two) +(all(two %in% one))))))

   a                      b                      count
   <chr>                  <chr>                  <dbl>
 1 34inUltrawideMonitor   LightningChargingCable     1
 2 34inUltrawideMonitor   iPhone                     0
 3 34inUltrawideMonitor   WiredHeadphones            0
 4 34inUltrawideMonitor   GooglePhone                0
 5 iPhone                 LightningChargingCable     1
 6 LightningChargingCable WiredHeadphones            0
 7 GooglePhone            LightningChargingCable     0
 8 iPhone                 WiredHeadphones            1
 9 GooglePhone            iPhone                     0
10 GooglePhone            WiredHeadphones            1


0
投票

使用

data.table
连接而不是共现矩阵的解决方案。对于更大的数据集(约 3M 行),在我的机器上,它的速度几乎是使用
这个答案
中的 crossprod(xtabs( 的两倍。

library(data.table)
library(Matrix) # for comparison with a co-occurrence matrix solution

# Example dataset
n <- 1e6L
orderID <- rep.int(sample.int(n), rpois(n, 1) + 2L)
dt <- unique(data.table(orderID, product = stringi::stri_rand_strings(length(orderID), 2, pattern = "[a-z]")))

# solution using a data.table join
f1 <- function(dt) {
  dt2 <- dt[
    , x := .I
  ][
    dt,
    on = .(orderID = orderID, x > x),
    nomatch = 0
  ][
    product > i.product, c("product", "i.product") := list(i.product, product)
  ][
    , .(count = .N), .(product, i.product)
  ]
  dt[, x := NULL]
  setnames(dt2, c("product1", "product2", "count"))
  setorder(dt2, -count, product1, product2)
}

# co-occurrence matrix solution (slightly modified so the output of the two
# functions is the same)
f2 <- function(dt) {
  dt$product <- as.factor(dt$product)
  dt4 <- setDT(
    summary(
      crossprod(
        xtabs(~ orderID + product, dt, sparse = TRUE)
      )
    )
  )[
    i < j
  ][
    , `:=`(
      i = as.character(levels(dt$product)[i]),
      j = as.character(levels(dt$product)[j]),
      x = as.integer(x)
    )
  ]
  dt[, product := as.character(product)]
  attr(dt4, "header") <- NULL
  setnames(dt4, c("product1", "product2", "count"))
  setorder(dt4, -count, product1, product2)
}

Benchmarking:
    
#> Unit: seconds
#>  expr      min       lq     mean   median       uq      max neval
#>    f1 1.026762 1.128890 1.224291 1.261732 1.278617 1.362014    10
#>    f2 1.984068 2.295159 2.355434 2.403337 2.465651 2.589428    10

-2
投票

我构建了一个简单的程序来确定 Python 和 Pandas 一起销售最多的产品。您无需安装或根据自己的喜好自定义代码即可使用它。详情请见https://nguyenvanthu.com/cach-tim-san-pham-ban-kem-cung-nhau-nhieu-nhat-tu-file-bao-cao-excel/

© www.soinside.com 2019 - 2024. All rights reserved.