根据3-4列删除重复项(dplyr)

问题描述 投票:0回答:1

我意识到以前可能有人问过这个问题,但我正在努力正确删除 df.txt 中的重复项。我已经使用了here推荐的方法,但它并没有删除所有重复项。

#安装软件包

#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(stringr)
library(textclean)
library(lubridate)

这是我的数据:

dput(df[1:10,c(1,2,3,4,5,6,7)])

数据输出:

structure(list(username = c("Engineeer", "ftpofmpo", "sagood",
"ishtarsg", "Ohayo!", "Engineeer"), post = c("Engineers are si ginnas who recently graduated from Universities. No one stays as an Engineer like forever.\nEngineering is harder than Business but more fulfilling in the long run.\nEngineer > Manager > Director > Chief Technology Officer > Chief Executive Officer\n\tzero to sixty times",
"\n\t\n\t\t\n\t\t\t\n\t\t\t\tEngineeer said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tThen pick up Engineering. Its harder but more fulfilling in the long run. No one stays as an Engineer like forever.\nEngineer > Manager > Director > Chief Technical Officer > Chief Executive\n\t\t\n\t\tClick to expand...\n\t\n\nhave you seen the past list of president scholars?\nif minister salary pegg to engineer pay jialat liao... check out lky statement on y salary must be high",
"i thought engineering ish dominated by ceca?????", "Always opt to be a priest.",
"after CEO beome mayor then minister?", "\n\t\n\t\t\n\t\t\t\n\t\t\t\tsagood said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\ti thought engineering ish dominated by ceca?????\n\t\t\n\t\tClick to expand...\n\t\nIf you fret Engineering its fine. Donate these good paying jobs to CECAs."
), date = structure(c(1622851200, 1622851200, 1622851200, 1622851200,
1622851200, 1622851200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), user_status = c("Supremacy Member", "Banned", "Member",
"Arch-Supremacy Member", "Great Supremacy Member", "Supremacy Member"
), treatment_implementation = c(0, 0, 0, 0, 0, 0), month_year = c(2021.41666666667,
2021.41666666667, 2021.41666666667, 2021.41666666667, 2021.41666666667,
2021.41666666667), id = c(255, 296, 747, 389, 634, 255)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

要删除重复的行,我将根据以下三列进行删除:

# Drop duplicate observations
df <-
df %>%
filter(duplicated(cbind(username, post, date)))

运行上面的代码后,当我手动检查数据时,我仍然看到重复的行。此外,当我在第一次重复删除尝试后再次运行上面相同的代码时,它会不断删除更多行,这很令人困惑,因为我认为所有重复行都应该在一次尝试中删除(即仅运行一次代码时)。

r dplyr tidyr lubridate stringr
1个回答
1
投票

您可以使用

distinct
包中的
dplyr
函数来实现根据特定列过滤掉重复项的目标。

df <- df %>%
  distinct(username, post, date, .keep_all = T)
© www.soinside.com 2019 - 2024. All rights reserved.