抓取文本数据中的特定细节/列(rvest)

问题描述 投票:0回答:0

我对网络抓取相对较新,我对从在线社交论坛抓取文本数据感兴趣。我能够成功抓取文本,但无法从文本数据中组织和收集具体细节。

目前,我的代码如下:

```{r}
library(tidyverse)
library(rvest)
```

# Scrape posts 
```{r}
pages <- 1:32

hardwarezone_list=list()

for(i in seq_along(pages)){  hardwarezone_link<-paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/","page-",i)
hardwarezone_page<-read_html(hardwarezone_link)  
hardwarezone_list[[i]] <- hardwarezone_page  %>% html_nodes(".bbWrapper")  %>% html_text()}
hardwarezone_table <- do.call(rbind,hardwarezone_list)
hardwarezone_table<- as.data.frame(hardwarezone_table)
```

#print data example
```{r}
dput(hardwarezone_table[1:2,c(1,2)])
```

输出:

structure(list(V1 = c(" https://www.channelnewsasia.com/ne...bs-restaurant-association-13441340?cid=FBcna \n\"You can see that F&B jobs are really not on top of the minds of Singaporeans even when there's high unemployment,\" says a business owner.", 
"I guesss majority prefer to either send food or eat food .. not prepare the food. Haha"
), V2 = c("Recession and retrenchment only happen in EDMW ", 
"\n\t\n\t\t\n\t\t\t\n\t\t\t\ttokong said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tno thanks, those people whose pop and mom are hawkers or have been hawkers will know. \nour parents will discourage us to become hawkers. better study hard and get a job.\nf and b jobs generate no values to your cv unless it is the end of the road for you.\nf and b pay is very jialat also. if the salary cannot feed your own family, why take the job?\nthose young punks who go into f and b either has the passion or enjoys the freedom of being not an employee\n\t\t\n\t\tClick to expand...\n\t\n\nyou will be shocked how much hawkers earn. even just those drink stall make kopi, teh kind and get soft drinks, ice from supplier and sell. don't mention bubble tea that one is considered quite artisanal.\nf&b has many positions, les amis executive chef also f&b, waitress also f&b, george quek also f&b. the value of CV is dependent on how a person wanna craft his career path, and not the industry."
)), row.names = 1:2, class = "data.frame")

但是,理想情况下,我想抓取每行/观察包含以下信息的数据,而不是仅仅收集帖子上的数据,就像我上面的代码一样。

username        post                                    date                     user status
tegridy_farm            why is that the case.                  3/10/2022               banned
Mackey                 why                                   3/10/2022             Senior member
eric cartman         kyle is bad                      3/10/2022             banned
r web-scraping dplyr rvest
© www.soinside.com 2019 - 2024. All rights reserved.