这个问题可能与这个question有关。
不幸的是,那里给出的解决方案不适用于我的数据。
我有以下矢量示例:
example<-c("ChildrenChildren", "Clothing and shoesClothing and shoes","Education, health and beautyEducation, health and beauty", "Leisure activities, travelingLeisure activities, traveling","LoansLoans","Loans and financial servicesLoans and financial services" ,"Personal transfersPersonal transfers" ,"Savings and investmentsSavings and investments","TransportationTransportation","Utility servicesUtility services")
我当然想要相同的字符串而不重复,即:
> result
[1] "Children" "Clothing and shoes" "Education, health and beauty"
那可能吗?
您可以使用sub
,直接在pattern
部分捕获您想要的位:
sub("(.+)\\1", "\\1", example)
#[1] "Children" "Clothing and shoes" "Education, health and beauty" "Leisure activities, traveling" "Loans"
#[6] "Loans and financial services" "Personal transfers" "Savings and investments" "Transportation" "Utility services"
(.+)
允许捕获一些模式,\\1
显示你刚捕获的内容,所以你想要找到的是“任何两次”,然后用相同的“任何东西”替换,但只需一次。
如果重复所有字符串,那么它们的长度是它们所需的两倍,所以取每个字符串的前半部分:
> substr(example, 1, nchar(example)/2)
[1] "Children" "Clothing and shoes"
[3] "Education, health and beauty" "Leisure activities, traveling"
[5] "Loans" "Loans and financial services"
[7] "Personal transfers" "Savings and investments"
[9] "Transportation" "Utility services"
我们可以尝试:
stringr::str_remove_all(example,"[a-z].*[A-Z]")
结果:
[1] "Children" "Clothing and shoes" "Education, health and beauty"
[4] "Leisure activities, traveling" "Loans" "Loans and financial services"
[7] "Personal transfers" "Savings and investments" "Transportation"
[10] "Utility services"