尝试从缩写名称列表中获取教育机构的完整名称

问题描述 投票:0回答:2

我有教育机构的简称。这里给出了可重复的样本

data <- structure(list(Affiliations = c("UNIV MELBOURNE", "UNIV NEWCASTLE", 
                                        "FORDHAM UNIV", "PRINCETON UNIV", 
                                        "CITY UNIV LONDON", "UNIV CONNECTICUT", 
                                        "EMORY UNIV", "NATL BUR ECON RES", 
                                        "NATL CHENGCHI UNIV", "OHIO STATE UNIV")), 
                  row.names = c(NA, -10L), 
                  class = c("tbl_df", "tbl", "data.frame"))

我正在尝试从这个列表中获取机构的完整名称。

例如“University of Melbourne”对“UNIV MELBOURNE”、“City, University of London”对“CITY UNIV LONDON”、“National Chengchi University”对“NATL CHENGCHI UNIV”。

目前,我正在使用“searcher”包通过浏览器手动搜索每个字符串,并使用 readline 函数更新完整名称。

library(searcher) # for the function, search_startpage

df$new <- NA

for (i in 1:length(df$Affiliations)) {
  search_startpage(data$Affiliations[i], rlang = F)
  data$new[i] <- readline()
}

这很耗时,因为我有 1000 多个附属机构。有没有有效的方法可以使用 rvest 或任何其他包来做到这一点?

r rvest
2个回答
0
投票

使用您通过以下方式创建的字典。首先,在控制台上将

unique
隶属关系打印为
data.frame
并将其粘贴到脚本中。

data.frame(x=sprintf("'%s'", sort(unique(data$Affiliations))))

填写一次

y
栏,在所有内容周围包裹
read.table(head=T, text="...")
以获取字典。

k <- read.table(header=TRUE, text="
                      x                                     y
1    'CITY UNIV LONDON'            'City University of London
2          'EMORY UNIV'                     'Emory University
3        'FORDHAM UNIV'                   'Fordham University
4   'NATL BUR ECON RES' 'National Bureau of Economic Research
5  'NATL CHENGCHI UNIV'         'National Chengchi University
6     'OHIO STATE UNIV'                'Ohio State University
7      'PRINCETON UNIV'                 'Princeton University
8    'UNIV CONNECTICUT'             'University of Conneticut
9      'UNIV MELBOURNE'              'University of Melbourne
10     'UNIV NEWCASTLE'              'University of Newcastle
")

最后使用

match
并将匹配项分配给您的数据框。

data$full <- k[match(data$Affiliations, k$x), 'y']
data
# # A tibble: 10 × 2
#   Affiliations       full                                
#   <chr>              <chr>                               
#  1 UNIV MELBOURNE     University of Melbourne              
#  2 UNIV NEWCASTLE     University of Newcastle             
#  3 FORDHAM UNIV       Fordham University                  
#  4 PRINCETON UNIV     Princeton University                
#  5 CITY UNIV LONDON   City University of London           
#  6 UNIV CONNECTICUT   University of Conneticut            
#  7 EMORY UNIV         Emory University                    
#  8 NATL BUR ECON RES  National Bureau of Economic Research
#  9 NATL CHENGCHI UNIV National Chengchi University        
# 10 OHIO STATE UNIV    Ohio State University  

0
投票

@ronak-shah

我已经得到了我想要的东西。

这是代码:

data$Affiliations <- gsub(" ", "+", data$Affiliations)

data$New <- NA

for (i in 1:nrow(data)) {
  url <- paste0("https://www.google.com/search?q=", data$Affiliations[i])
  x <- read_html(url) %>% html_nodes("h3") %>% html_text()
  print(x)
  data$New[i] <- x[as.numeric(readline())]
}

我可以从搜索结果中选择合适的名称。

 [1] "Melbourne (City in Australia)"                                 
 [2] "Melbourne City"                                                
 [3] "University of Melbourne"                                       
 [4] "Edwise International - Study Abroad Consultants - Chennai"     
 [5] "University of Melbourne"                                       
 [6] "The University of Melbourne"                                   
 [7] "The University of Melbourne (Unimelb) - Ranking, Fees"         
 [8] "The University of Melbourne : Rankings, Fees & Courses Details"
 [9] "University of Melbourne - Wikipedia"                           
[10] "Monash University - one of the top universities in Australia"  
[11] "The University of Melbourne | Study Options"                   
3
 [1] "Newcastle University: The things we do here make a difference out ..."   
 [2] "Newcastle University"                                                    
 [3] "University of Newcastle (Public university in Callaghan, Australia)"     
 [4] "Postgraduate - Newcastle University"                                     
 [5] "The University of Newcastle, Australia"                                  
 [6] "International - The University of Newcastle, Australia"                  
 [7] "Newcastle University, Exams: Rankings, Fees, Courses"                    
 [8] "The University of Newcastle - Ranking, Courses, Fees, Entry criteria ..."
 [9] "Newcastle University - Wikipedia"                                        
[10] "Newcastle University : Rankings, Fees & Courses Details"                 
[11] "Newcastle University courses and application information - SI-UK"        
[12] "Newcastle University | Apply Now for 2021 | INTO"                        
5
 [1] "Fordham University"                                             
 [2] "COVID-19 Guidelines - Fordham University"                       
 [3] "Fordham University School of Law"                               
 [4] "Academics | Fordham"                                            
 [5] "Fordham University - Wikipedia"                                 
 [6] "Fordham University - Profile, Rankings and Data | US News Best" 
 [7] "Fordham University (Gabelli) - Best Business Schools - US News" 
 [8] "Fordham University: Rankings, Fees, Courses, Admission 2021 ..."
 [9] "Fordham University - Niche"                                     
[10] "Fordham University Athletics - Official Athletics Website"      
1
[1] "Princeton University"                                             
[2] "Princeton University"                                             
[3] "Princeton"                                                        
[4] "Princeton University Graduate School"                             
[5] "Princeton University - Wikipedia"                                 
[6] "Princeton University : Rankings, Fees & Courses Details | Top"    
[7] "Princeton University - Profile, Rankings and Data | US News Best" 
[8] "Princeton University: Rankings, Fees, Courses, Admission 2021 ..."
1
[1] "City, University of London"                            
[2] "City, University of London"                            
[3] "CITY, University of London: Rankings, Fees, Courses"   
[4] "City, University of London - Wikipedia"                
[5] "City, University of London | Apply Now for 2021 | INTO"
[6] "City, University of London"                            
1
 [1] "University of Connecticut"                                        
 [2] "University of Connecticut"                                        
 [3] "University of Connecticut - Wikipedia"                            
 [4] "University of Connecticut (UCONN) - Shiksha Study Abroad"         
 [5] "University of Connecticut (Uconn) - Profile, Rankings - U.S. News"
 [6] "University of Connecticut : Rankings, Fees & Courses Details"     
 [7] "University of Connecticut - Niche"                                
 [8] "University of Bridgeport: A Leading University in Connecticut"    
 [9] "University of Connecticut | LinkedIn"                             
[10] "Southern Connecticut State University"                            
[11] "Eastern Connecticut State University"                             
1
 [1] "Home | Emory University | Atlanta GA"                                   
 [2] "Emory University"                                                       
 [3] "Emory University (Medical school in Atlanta, Georgia)"                  
 [4] "Emory University School of Law (Independent school in Atlanta, Georgia)"
 [5] "Home | Emory University | Atlanta GA"                                   
 [6] "Emory School of Medicine - Emory University"                            
 [7] "Degrees and Programs - Academics - Emory University"                    
 [8] "Explore Emory | Emory University | Atlanta GA"                          
 [9] "Emory University - Wikipedia"                                           
[10] "Emory University - Profile, Rankings and Data | US News Best"           
[11] "Emory Healthcare: Atlanta Hospitals, Clinics and Healthcare ..."        
[12] "Emory University (EU) - Shiksha Study Abroad"                           
2
[1] "National Bureau of Economic Research | NBER"                        
[2] "National Bureau of Economic Research bulletin on aging and health"  
[3] "PubMed Central, Figure 1: Natl Bur Econ Res Bull Aging Health. 2011"
[4] "PubMed Central, Figure II - NCBI"                                   
[5] "Education and Health: Evaluating Theories and Evidence"             
[6] "Home - The National Bureau of Asian Research (NBR)"                 
[7] "[XLS] Economics & Business"                                         
[8] "PRIME PubMed | Natl Bur Econ Res Bull Aging Health journal ..."     
[9] "Friedman and the Quantity Theory - Michael J. Gootzeit, 1980"       
1
 [1] "National Chengchi University"                                   
 [2] "Admission - National Chengchi University"                       
 [3] "國立政治大學: NCCU"                                             
 [4] "National Chengchi University - Wikipedia"                       
 [5] "National Chengchi University | World University Rankings | THE" 
 [6] "National Chengchi University : Rankings, Fees & Courses Details"
 [7] "National Chengchi University - MastersPortal.com"               
 [8] "National Chengchi University in Taiwan - Masterstudies"         
 [9] "National Chengchi University in Taiwan - US News Best"          
[10] "National Chengchi University | Ranking & Review - uniRank"      
[11] "National Chengchi University | LinkedIn"                        
1
[1] "The Ohio State University"                                      
[2] "The Ohio State University"                                      
[3] "Ohio State Buckeyes football (Football team)"                   
[4] "Ohio State University - Wikipedia"                              
[5] "Ohio State Buckeyes | Ohio State University Athletics"          
[6] "The Ohio State University (OSU) - Shiksha Study Abroad"         
[7] "Ohio State University--Columbus - Profile, Rankings - U.S. News"
[8] "Welcome to Ohio University"                                     
1

最终数据框是

# A tibble: 10 x 2
   Affiliations       New                                        
   <chr>              <chr>                                      
 1 UNIV+MELBOURNE     University of Melbourne                    
 2 UNIV+NEWCASTLE     The University of Newcastle, Australia     
 3 FORDHAM+UNIV       Fordham University                         
 4 PRINCETON+UNIV     Princeton University                       
 5 CITY+UNIV+LONDON   City, University of London                 
 6 UNIV+CONNECTICUT   University of Connecticut                  
 7 EMORY+UNIV         Emory University                           
 8 NATL+BUR+ECON+RES  National Bureau of Economic Research | NBER
 9 NATL+CHENGCHI+UNIV National Chengchi University               
10 OHIO+STATE+UNIV    The Ohio State University               
© www.soinside.com 2019 - 2024. All rights reserved.