我有教育机构的简称。这里给出了可重复的样本
data <- structure(list(Affiliations = c("UNIV MELBOURNE", "UNIV NEWCASTLE",
"FORDHAM UNIV", "PRINCETON UNIV",
"CITY UNIV LONDON", "UNIV CONNECTICUT",
"EMORY UNIV", "NATL BUR ECON RES",
"NATL CHENGCHI UNIV", "OHIO STATE UNIV")),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
我正在尝试从这个列表中获取机构的完整名称。
例如“University of Melbourne”对“UNIV MELBOURNE”、“City, University of London”对“CITY UNIV LONDON”、“National Chengchi University”对“NATL CHENGCHI UNIV”。
目前,我正在使用“searcher”包通过浏览器手动搜索每个字符串,并使用 readline 函数更新完整名称。
library(searcher) # for the function, search_startpage
df$new <- NA
for (i in 1:length(df$Affiliations)) {
search_startpage(data$Affiliations[i], rlang = F)
data$new[i] <- readline()
}
这很耗时,因为我有 1000 多个附属机构。有没有有效的方法可以使用 rvest 或任何其他包来做到这一点?
使用您通过以下方式创建的字典。首先,在控制台上将
unique
隶属关系打印为 data.frame
并将其粘贴到脚本中。
data.frame(x=sprintf("'%s'", sort(unique(data$Affiliations))))
填写一次
y
栏,在所有内容周围包裹 read.table(head=T, text="...")
以获取字典。
k <- read.table(header=TRUE, text="
x y
1 'CITY UNIV LONDON' 'City University of London
2 'EMORY UNIV' 'Emory University
3 'FORDHAM UNIV' 'Fordham University
4 'NATL BUR ECON RES' 'National Bureau of Economic Research
5 'NATL CHENGCHI UNIV' 'National Chengchi University
6 'OHIO STATE UNIV' 'Ohio State University
7 'PRINCETON UNIV' 'Princeton University
8 'UNIV CONNECTICUT' 'University of Conneticut
9 'UNIV MELBOURNE' 'University of Melbourne
10 'UNIV NEWCASTLE' 'University of Newcastle
")
最后使用
match
并将匹配项分配给您的数据框。
data$full <- k[match(data$Affiliations, k$x), 'y']
data
# # A tibble: 10 × 2
# Affiliations full
# <chr> <chr>
# 1 UNIV MELBOURNE University of Melbourne
# 2 UNIV NEWCASTLE University of Newcastle
# 3 FORDHAM UNIV Fordham University
# 4 PRINCETON UNIV Princeton University
# 5 CITY UNIV LONDON City University of London
# 6 UNIV CONNECTICUT University of Conneticut
# 7 EMORY UNIV Emory University
# 8 NATL BUR ECON RES National Bureau of Economic Research
# 9 NATL CHENGCHI UNIV National Chengchi University
# 10 OHIO STATE UNIV Ohio State University
@ronak-shah
我已经得到了我想要的东西。
这是代码:
data$Affiliations <- gsub(" ", "+", data$Affiliations)
data$New <- NA
for (i in 1:nrow(data)) {
url <- paste0("https://www.google.com/search?q=", data$Affiliations[i])
x <- read_html(url) %>% html_nodes("h3") %>% html_text()
print(x)
data$New[i] <- x[as.numeric(readline())]
}
我可以从搜索结果中选择合适的名称。
[1] "Melbourne (City in Australia)"
[2] "Melbourne City"
[3] "University of Melbourne"
[4] "Edwise International - Study Abroad Consultants - Chennai"
[5] "University of Melbourne"
[6] "The University of Melbourne"
[7] "The University of Melbourne (Unimelb) - Ranking, Fees"
[8] "The University of Melbourne : Rankings, Fees & Courses Details"
[9] "University of Melbourne - Wikipedia"
[10] "Monash University - one of the top universities in Australia"
[11] "The University of Melbourne | Study Options"
3
[1] "Newcastle University: The things we do here make a difference out ..."
[2] "Newcastle University"
[3] "University of Newcastle (Public university in Callaghan, Australia)"
[4] "Postgraduate - Newcastle University"
[5] "The University of Newcastle, Australia"
[6] "International - The University of Newcastle, Australia"
[7] "Newcastle University, Exams: Rankings, Fees, Courses"
[8] "The University of Newcastle - Ranking, Courses, Fees, Entry criteria ..."
[9] "Newcastle University - Wikipedia"
[10] "Newcastle University : Rankings, Fees & Courses Details"
[11] "Newcastle University courses and application information - SI-UK"
[12] "Newcastle University | Apply Now for 2021 | INTO"
5
[1] "Fordham University"
[2] "COVID-19 Guidelines - Fordham University"
[3] "Fordham University School of Law"
[4] "Academics | Fordham"
[5] "Fordham University - Wikipedia"
[6] "Fordham University - Profile, Rankings and Data | US News Best"
[7] "Fordham University (Gabelli) - Best Business Schools - US News"
[8] "Fordham University: Rankings, Fees, Courses, Admission 2021 ..."
[9] "Fordham University - Niche"
[10] "Fordham University Athletics - Official Athletics Website"
1
[1] "Princeton University"
[2] "Princeton University"
[3] "Princeton"
[4] "Princeton University Graduate School"
[5] "Princeton University - Wikipedia"
[6] "Princeton University : Rankings, Fees & Courses Details | Top"
[7] "Princeton University - Profile, Rankings and Data | US News Best"
[8] "Princeton University: Rankings, Fees, Courses, Admission 2021 ..."
1
[1] "City, University of London"
[2] "City, University of London"
[3] "CITY, University of London: Rankings, Fees, Courses"
[4] "City, University of London - Wikipedia"
[5] "City, University of London | Apply Now for 2021 | INTO"
[6] "City, University of London"
1
[1] "University of Connecticut"
[2] "University of Connecticut"
[3] "University of Connecticut - Wikipedia"
[4] "University of Connecticut (UCONN) - Shiksha Study Abroad"
[5] "University of Connecticut (Uconn) - Profile, Rankings - U.S. News"
[6] "University of Connecticut : Rankings, Fees & Courses Details"
[7] "University of Connecticut - Niche"
[8] "University of Bridgeport: A Leading University in Connecticut"
[9] "University of Connecticut | LinkedIn"
[10] "Southern Connecticut State University"
[11] "Eastern Connecticut State University"
1
[1] "Home | Emory University | Atlanta GA"
[2] "Emory University"
[3] "Emory University (Medical school in Atlanta, Georgia)"
[4] "Emory University School of Law (Independent school in Atlanta, Georgia)"
[5] "Home | Emory University | Atlanta GA"
[6] "Emory School of Medicine - Emory University"
[7] "Degrees and Programs - Academics - Emory University"
[8] "Explore Emory | Emory University | Atlanta GA"
[9] "Emory University - Wikipedia"
[10] "Emory University - Profile, Rankings and Data | US News Best"
[11] "Emory Healthcare: Atlanta Hospitals, Clinics and Healthcare ..."
[12] "Emory University (EU) - Shiksha Study Abroad"
2
[1] "National Bureau of Economic Research | NBER"
[2] "National Bureau of Economic Research bulletin on aging and health"
[3] "PubMed Central, Figure 1: Natl Bur Econ Res Bull Aging Health. 2011"
[4] "PubMed Central, Figure II - NCBI"
[5] "Education and Health: Evaluating Theories and Evidence"
[6] "Home - The National Bureau of Asian Research (NBR)"
[7] "[XLS] Economics & Business"
[8] "PRIME PubMed | Natl Bur Econ Res Bull Aging Health journal ..."
[9] "Friedman and the Quantity Theory - Michael J. Gootzeit, 1980"
1
[1] "National Chengchi University"
[2] "Admission - National Chengchi University"
[3] "國立政治大學: NCCU"
[4] "National Chengchi University - Wikipedia"
[5] "National Chengchi University | World University Rankings | THE"
[6] "National Chengchi University : Rankings, Fees & Courses Details"
[7] "National Chengchi University - MastersPortal.com"
[8] "National Chengchi University in Taiwan - Masterstudies"
[9] "National Chengchi University in Taiwan - US News Best"
[10] "National Chengchi University | Ranking & Review - uniRank"
[11] "National Chengchi University | LinkedIn"
1
[1] "The Ohio State University"
[2] "The Ohio State University"
[3] "Ohio State Buckeyes football (Football team)"
[4] "Ohio State University - Wikipedia"
[5] "Ohio State Buckeyes | Ohio State University Athletics"
[6] "The Ohio State University (OSU) - Shiksha Study Abroad"
[7] "Ohio State University--Columbus - Profile, Rankings - U.S. News"
[8] "Welcome to Ohio University"
1
最终数据框是
# A tibble: 10 x 2
Affiliations New
<chr> <chr>
1 UNIV+MELBOURNE University of Melbourne
2 UNIV+NEWCASTLE The University of Newcastle, Australia
3 FORDHAM+UNIV Fordham University
4 PRINCETON+UNIV Princeton University
5 CITY+UNIV+LONDON City, University of London
6 UNIV+CONNECTICUT University of Connecticut
7 EMORY+UNIV Emory University
8 NATL+BUR+ECON+RES National Bureau of Economic Research | NBER
9 NATL+CHENGCHI+UNIV National Chengchi University
10 OHIO+STATE+UNIV The Ohio State University