用r删除ajax站点

Question

有人知道我可以用httr和rvest刮除此site还是this one，还是应该使用硒或phantomjs？

这两个站点似乎都在使用ajax，但我似乎无法通过它。

基本上，我追求的是以下内容：

# I want this to return the titles of the listings, but I get character(0)
"https://www.sahibinden.com/satilik" %>% 
  read_html() %>% 
  html_nodes(".searchResultsItem .classifiedTitle") %>% 
  html_text() 

# I want this to return the prices of the listings, but I get 503
"https://www.hurriyetemlak.com/konut" %>% 
  read_html() %>% 
  html_nodes(".listing-item .list-view-price") %>% 
  html_text()

欢迎使用v8进行任何构想，或进行模拟会议。

此外，也欢迎使用任何纯卷曲解决方案。稍后，我将尝试将其翻译为httr ：）

谢谢

Answer 1

您必须设置cookie才能成功请求。

应该检查站点（禁止）是否允许刮擦。

[robotstxt::paths_allowed(paths = "https://www.sahibinden.com/satilik", warn = FALSE)-> robotstxt似乎没有禁止它]
如果在删除浏览器中的cookie后更新站点，则该站点将不再允许访问并报告异常行为->指示针对刮擦的对策
确保您应该阅读使用条款。

因此，我将共享“理论”代码，但不共享所需的cookie数据，因为它始终取决于用户。

完整代码将显示为：

library(xml2)
library(httr)
library(magrittr)
library(DT)
url <- "https://www.sahibinden.com/satilik"

YOUR_COOKIE_DATA <- NULL
if(is.null(YOUR_COOKIE_DATA)){
  stop("You did not set your cookie data. 
        Also please check if terms of usage allow the scraping.")
}
response <- url %>% GET(add_headers(.headers = c(Cookie = YOUR_COOKIE_DATA))) %>%
            content(type = "text", encoding = "UTF-8")
xpathes <- data.frame(
    XPath0 = 'td[2]',
    XPath1 = 'td[3]/a[1]',
    XPath2 = 'td/span[1]',
    XPath3 = 'td/span[2]',
    XPath4 = 'td[4]',
    XPath5 = 'td[5]',
    XPath6 = 'td[6]',
    XPath7 = 'td[7]',
    XPath8 = 'td[8]'
)

nodes <- response %>% read_html %>% html_nodes(xpath = 
"/html/body/div/div/form/div/div/table/tbody/tr"
)

output <- lapply(xpathes, function(xpath){
    lapply(nodes, function(node) html_nodes(x = node, xpath = xpath) %>% 
    {ifelse(length(.), yes = html_text(.), no = NA)}) %>% unlist
})
output %>% data.frame %>% DT::datatable()

关于抓取网站数据的权利。我尝试遵循：Should questions that violate API Terms of Service be flagged?。不过，在这种情况下，它是“潜在违规行为”。

以编程方式读取Cookie：

我不确定是否可以使用浏览器完全跳过：

用r删除ajax站点

问题描述投票：2回答：1

1个回答

最新问题

用r删除ajax站点

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1