我正在尝试使用
rvest
从网站上抓取数据。我读入了页面的 html,然后提取了表格。此后,我使用rvest::html_form_set
对表格进行更改,然后提交。查看表单后,我意识到没有提交按钮。网站上可用的按钮是一个带有脚本 href 的锚标记。我尝试使用 rvest::session_follow_link()
但无法获取数据。这是不起作用的代码:
trademark_search_page <- rvest::session('https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx')
search_form <- rvest::html_form(trademark_search_page)[[1]]
search_form <- search_form %>% rvest::html_form_set(`ctl00$ContentPlaceHolder1$TBWordmark` = 'Bull',
`ctl00$ContentPlaceHolder1$TBClass` = 32)
resp <- trademark_search_page %>% rvest::session_submit(search_form) %>%
rvest::session_follow_link(xpath = '//a[@id = "ContentPlaceHolder1_BtnSearch"]')
对我应该做什么有什么建议吗?
我认为使用
rvest
可能会很棘手,因为该按钮引用了一个 javascript 脚本。如果您对其他工具持开放态度,这里是如何使用RSelenium
# load libraries
library(RSelenium)
# define url ---------------------------------------------------------
url <- "https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx"
# define search terms ------------------------------
word_mark <- "Bull"
class_search_term <- "32"
# start RSelenium ------------------------------------------------------------
rD <- rsDriver(browser="firefox", port=4548L, chromever = NULL)
remDr <- rD[["client"]]
# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)
# fill in the form ------------------------------------------------
# this finds the html element for each part of the form
# and fills it in with the value we want
# Wordmark
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBWordmark")$sendKeysToElement(list(word_mark))
# Class
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBClass")$sendKeysToElement(list(class_search_term))
# click submit button ---------------------------------------
remDr$findElements("id", "ContentPlaceHolder1_BtnSearch")[[1]]$clickElement()
进入此页面后,您可以使用
rvest
获得更多详细信息链接列表
library(rvest)
library(magrittr)
# pull html from page
html <- remDr$getPageSource()[[1]]
# find all the html elements with the .LnkshowDetails class
more_details_butons <- html %>% read_html() %>%
html_nodes(".LnkshowDetails") %>%
html_attr("id")
然后您可以遍历所有按钮并单击它们或提取数据