数据操作：基于变量选择用户

Question

我目前正在从事机器学习项目。我有一个很大的数据集，是从论坛www.stormfront.com上刮下来的。数据集有7列：stormfront_self_content（论坛帖子），stormfront_lang_id，stormfront_publication_date，stormfront_topic，stormfront_docid，stormfront_category，stormfront_user。

我想选择一组已经在论坛上注册了一年以上的用户，已经写了500多个帖子，但是我不确定该怎么做。

任何帮助将不胜感激。

Answer 1

[假设您有代表每个用户的id列，我们可以group_by每个id选择具有超过500行以及其发布日期之间max和min之间的天数为多的组大于365。

library(dplyr)
library(lubridate)

df %>%
  mutate(stormfront_publication_date = ymd_hms(stormfront_publication_date)) %>%
  group_by(id) %>%
  filter(n() > 500 & difftime(max(stormfront_publication_date), 
                    min(stormfront_publication_date),units = 'days') > 365)

数据操作：基于变量选择用户

问题描述投票：-1回答：1

1个回答

最新问题

数据操作：基于变量选择用户

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1