使用 RSelenium 在 R 中抓取 Reddit 时捕获多个标签

问题描述 投票:0回答:2

enter image description here我正在编写代码,从一个项目的 Reddit 帖子中抓取帖子标题、评论和作者姓名。 我可以通过网络抓取帖子标题、作者姓名,但评论无法正确提取。

如果该帖子有 31 条评论,则每条评论将被提取 31 次。以下代码供参考:

# load packages
library(RSelenium)
library(netstat)

# start the server
rs_driver_object <- rsDriver(browser = 'firefox',verbose = FALSE, port = free_port(), chromever = NULL)

# create a client object
remDr <- rs_driver_object$client

# open a browser
remDr$open()
# maximize window
remDr$maxWindowSize()

remDr$navigate("https://www.reddit.com/r/AnimeReviews/comments/essf1u/assassination_classroom_is_a_1010_the_charm_the/")

Sys.sleep(2)

# scroll to the end of the webpage
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(2)
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")

load_more_comments <- remDr$findElement(using = 'xpath', '//*[@id="comment-tree"]/faceplate-partial/div[1]/button')
load_more_comments$clickElement()
#load_more_comments$refresh()

#pickup title
title <- remDr$findElement(using = 'xpath', '//*[@id="main-content"]/shreddit-title')$getElementAttribute('title')

#comments
comment_list <- remDr$findElements(using = 'tag name', 'shreddit-comment')
#print(typeof(comment_list))

for (each_comment in comment_list) {
  print(paste("Author --->", each_comment$getElementAttribute('author')))
  
  p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p")

  # Extract and print the text from each <p> tag
  for (p_tag in p_tags) {
    print(p_tag$getElementText())
  }

}

请参考下面的截图:

我不知道为什么它不能只工作一次。 怎么做好像有点问题

p_标签<- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p") is working

参考上面的代码,我尝试使用 RSelenium 在 R 中进行网页抓取。我试图抓取 Reddit 评论,但它们会出现多次而不是一次。

r selenium-webdriver web-scraping reddit rselenium
2个回答
0
投票

findElements
搜索整个 HTML,需要使用
findChildElements
。这应该有效(替换你的最后一个循环):

lapply(comment_list, \(c) {
  author <- unlist(c$getElementAttribute('author'))
  comment <- unlist(lapply(c$findChildElements(using = "xpath", value = ".//div[3]/div/p"), \(p) {
    p$getElementText()
  }))
  
  list(author = author, comment = comment)
})

#> [[1]]$author
#> [1] "dotti1999"
#> 
#> [[1]]$comment
#> [1] "this shit was fucking insane"                                                                  
#> [2] "honestly I adored this anime when I first watched it..."
#> ...
#> [[3]]$author
#> [1] "[deleted]"
#> 
#> [[3]]$comment
#> [1] "Give me a be If premise and I will give it a watch"
#>  ...

请注意,这似乎仍然无法让您对评论进行回复


0
投票

PBulls 的回答有效。 以下是一种替代方案,尽管它需要额外的清洁步骤。 对于(评论列表中的每个评论){

  author <- each_comment$getElementAttribute('author')
  
  # Get the HTML content of the comment
  comment_html <- each_comment$getElementAttribute("innerHTML")
  
  # Extract comment text using regex
  comment_text <- gsub("<.*?>", "", comment_html)
  comment_text <- gsub("\n", "", comment_text)
  # Print author and comment text
  print(paste("Author --->", author))
  print(comment_text)

}

© www.soinside.com 2019 - 2024. All rights reserved.