我的任务是为不同的医院收集 Glassdoor 评论,但我很难提取优点、缺点、对管理层的建议、推荐、CEO 批准、业务前景和小的评级下降。我已经能够从下面的代码中提取其余部分。任何帮助将不胜感激。
library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm? sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)
# Extract review titles
review_titles <- page %>%
html_nodes(".reviewLink") %>%
html_text()
# Extract review dates
review_dates <- page %>%
html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text()
#Extract Pros
review_pros <- page %>%
html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
html_text()
print(review_pros)
# Extract review ratings
review_ratings <- page %>%
html_nodes(".ratingNumber.mr-xsm") %>%
html_text() %>%
str_extract("\d+") %>%
as.integer()
# Extract review recommendations
recommendations <- page %>%
html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
html_text()
# Convert recommendations to numeric values
recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))
# Create data frame
reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)
# View data frame
reviews
我能够这样得出利弊:
library(tidyverse)
library(rvest)
data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng" %>%
read_html() %>%
html_elements(".empReview")
tibble(
title = data %>%
html_element(".reviewLink") %>%
html_text2(),
date = data %>%
html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text2(),
pros = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
html_text2(),
cons = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
html_text2() %>%
str_trim()
) %>%
separate(col = date, into = c("date", "position"), sep = " - ")
# A tibble: 10 × 5
title date position pros cons
<chr> <chr> <chr> <chr> <chr>
1 Great place to work Mar 3, 2023 Manager Excellent… "Do …
2 Don't bother May 10, 2023 Practice Manager Being loc… "Hor…
3 Skeleton staffing Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
4 Nyack hospital Mar 7, 2023 Patient Care Associate (PCA) The food … "No …
5 Its ok Mar 22, 2023 Registered Nurse, BSN one weeke… "sho…
6 pca Jan 18, 2023 Patient Care Assistant (PCA) good pay … "non…
7 Just for starters Feb 3, 2023 Registered Nurse, Critical Care Coworkers… "No …
8 PCA Oct 22, 2022 Emergency Care Assistant there sta… "the…
9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager Most ever… "Lon…
10 great place to work Sep 5, 2022 Registered Nurse lots of o… "lim…
您要查找的数据存储在脚本中。这个答案是基于一个类似的问题。 使用 rvest 未在网页上显示的网页抓取数据
花了一段时间的搜索和反复试验才正确。在脚本中有一个部分以 "reviews": 开头并以 }]} 结尾。在这种情况下,是在第二次出现评论之后。就是把这部分提取出来,从JSON转换过来。
library(stringr)
library(xml2)
library(rvest)
library(dplyr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-page %>% html_elements(xpath='//script')
#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#Extract text for the reviews from the script. this is the second reviews section This is almost valid JSON format
reviews <-scripts[ratingsScript] %>% html_text2() %>%
str_extract("\"reviews\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\"reviews\":.+?\\}\\]\\}")
nchar(reviews) #debugging status
#add a leading { to make valid JSON and convert
answer <-jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]
答案数据框中有很多潜在有用的信息。职位状态、评论、reviewers id、star reviews等