R Glassdoor 网页抓取

问题描述 投票:0回答:2

我的任务是为不同的医院收集 Glassdoor 评论,但我很难提取优点、缺点、对管理层的建议、推荐、CEO 批准、业务前景和小的评级下降。我已经能够从下面的代码中提取其余部分。任何帮助将不胜感激。

library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)

   url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?            sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
   page <- read_html(url)

# Extract review titles

review_titles <- page %>%
html_nodes(".reviewLink") %>%
html_text()

# Extract review dates

review_dates <- page %>%
html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text()

#Extract Pros
review_pros <- page %>%
html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
html_text()
print(review_pros)

# Extract review ratings

review_ratings <- page %>%
html_nodes(".ratingNumber.mr-xsm") %>%
html_text() %>%
str_extract("\d+") %>%
as.integer()

# Extract review recommendations

recommendations <- page %>%
html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
html_text()

# Convert recommendations to numeric values

recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))

# Create data frame

reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)

# View data frame

reviews
r web screen-scraping rvest
2个回答
0
投票

我能够这样得出利弊:

library(tidyverse)
library(rvest)

data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng" %>% 
  read_html() %>% 
  html_elements(".empReview")

tibble(
  title = data %>% 
    html_element(".reviewLink") %>% 
    html_text2(), 
  date = data %>%  
    html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>% 
    html_text2(), 
  pros = data %>% 
    html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>% 
    html_text2(), 
  cons = data %>%  
    html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>% 
    html_text2() %>% 
    str_trim()
) %>% 
  separate(col = date, into = c("date", "position"), sep = " - ")

# A tibble: 10 × 5
   title                             date         position                         pros       cons 
   <chr>                             <chr>        <chr>                            <chr>      <chr>
 1 Great place to work               Mar 3, 2023  Manager                          Excellent… "Do …
 2 Don't bother                      May 10, 2023 Practice Manager                 Being loc… "Hor…
 3 Skeleton staffing                 Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
 4 Nyack hospital                    Mar 7, 2023  Patient Care Associate (PCA)     The food … "No …
 5 Its ok                            Mar 22, 2023 Registered Nurse, BSN            one weeke… "sho…
 6 pca                               Jan 18, 2023 Patient Care Assistant (PCA)     good pay … "non…
 7 Just for starters                 Feb 3, 2023  Registered Nurse, Critical Care  Coworkers… "No …
 8 PCA                               Oct 22, 2022 Emergency Care Assistant         there sta… "the…
 9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager          Most ever… "Lon…
10 great place to work               Sep 5, 2022  Registered Nurse                 lots of o… "lim…

0
投票

您要查找的数据存储在脚本中。这个答案是基于一个类似的问题。 使用 rvest 未在网页上显示的网页抓取数据

花了一段时间的搜索和反复试验才正确。在脚本中有一个部分以 "reviews": 开头并以 }]} 结尾。在这种情况下,是在第二次出现评论之后。就是把这部分提取出来,从JSON转换过来。

library(stringr) 
library(xml2)
library(rvest) 
library(dplyr)

url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)


#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-page %>% html_elements(xpath='//script')

#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))

#Extract text for the reviews from the script.  this is the second reviews section This is almost valid JSON format
reviews <-scripts[ratingsScript] %>% html_text2() %>% 
   str_extract("\"reviews\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\"reviews\":.+?\\}\\]\\}") 
nchar(reviews)  #debugging status

#add a leading { to make valid JSON and convert
answer <-jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

答案数据框中有很多潜在有用的信息。职位状态、评论、reviewers id、star reviews等

© www.soinside.com 2019 - 2024. All rights reserved.