数据清理 - 无法使FindReplace功能按预期工作

问题描述 投票:1回答:1

我有一个大型数据框,其列有数千个不同的位置(城市)名称,我需要简化/清理它。

经过相当多的斗争并尝试使用正则表达式和循环,我找到了DataCombine包和FindReplace,它意味着做我想要的但我无法使它工作。

所以我有:

   UserId          Location
1   USR_1             Paris
2   USR_2            London
3   USR_3           Londres
4   USR_4           Neuilly
5   USR_5            Berlin
6   USR_6    London Chelsea
7   USR_7 Berlin Schoenfeld
8   USR_8          Paris-20
9   USR_9           Neuilly
10 USR_10     Friedrischain

清洁只是一种替代,例如“伦敦切尔西”应该是“伦敦”,“布鲁克林”应该是“纽约市”,“巴黎20e”和“巴黎14”应该是“巴黎”。为了更进一步,我希望所有具有“巴黎”模式的东西都被“巴黎”取代(在SQL中类似于“巴黎%”)。

# Data for testing
library(DataCombine)
user_test <- data_frame(x <- paste("USR", as.character(1:10), sep = "_"), y <- c("Paris", "London", "Londres", "Neuilly", " Berlin", "London Chelsea", "Berlin Schoenfeld", "Paris-20", "Neuilly", "Friedrischain"))
colnames(user_test) <- c("UserId","Location")
user_test <- as.data.frame(user_test) ### Not sure why I have to put it there but otherwise it doesn't have the dataframe class
should_be <- data_frame(c("Paris", "London", "Berlin", "Neuilly", "Friedr"), c("Paris", "London", "Berlin", "Paris", "Berlin"))
colnames(should_be) <- c("is","should_be")

# Calling the function
FindReplace(data = user_test, Var = "Location", replaceData = should_be, from = "is", to = "should_be", exact = FALSE, vector = FALSE)

函数返回:

   UserId          Location
1   USR_1             Paris
2   USR_2            London
3   USR_3           Londres
4   USR_4             Paris
5   USR_5            Berlin
6   USR_6    London Chelsea
7   USR_7 Berlin Schoenfeld
8   USR_8          Paris-20
9   USR_9             Paris
10 USR_10     Berlinischain

部分清理(字符串已被替换)但不是整个条目。

关于我怎么做的任何想法?用grep循环?比赛?或者我真的必须构建一个绝对所有所需条目的清洁数据框。

r find-replace
1个回答
0
投票

合并。

# Data for testing
library(tidyverse)

left_join(user_test, should_be, by = c("Location"="is")) %>% 
  mutate(final = coalesce(should_be, Location))

#> # A tibble: 10 x 4
#>    UserId Location          should_be final            
#>    <chr>  <chr>             <chr>     <chr>            
#>  1 USR_1  Paris             Paris     Paris            
#>  2 USR_2  London            London    London           
#>  3 USR_3  Londres           <NA>      Londres          
#>  4 USR_4  Neuilly           Paris     Paris            
#>  5 USR_5  " Berlin"         <NA>      " Berlin"        
#>  6 USR_6  London Chelsea    <NA>      London Chelsea   
#>  7 USR_7  Berlin Schoenfeld <NA>      Berlin Schoenfeld
#>  8 USR_8  Paris-20          <NA>      Paris-20         
#>  9 USR_9  Neuilly           Paris     Paris            
#> 10 USR_10 Friedrischain     <NA>      Friedrischain
Created on 2018-03-03 by the reprex package (v0.2.0).
© www.soinside.com 2019 - 2024. All rights reserved.