单击 rvest 中的提交链接

问题描述 投票:0回答:1

我正在尝试使用

rvest
从网站上抓取数据。我读入了页面的 html,然后提取了表格。此后,我使用
rvest::html_form_set
对表格进行更改,然后提交。查看表单后,我意识到没有提交按钮。网站上可用的按钮是一个带有脚本 href 的锚标记。我尝试使用
rvest::session_follow_link()
但无法获取数据。这是不起作用的代码:

trademark_search_page <- rvest::session('https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx')
      search_form <-  rvest::html_form(trademark_search_page)[[1]]

      search_form <- search_form %>% rvest::html_form_set(`ctl00$ContentPlaceHolder1$TBWordmark` = 'Bull',
                                                          `ctl00$ContentPlaceHolder1$TBClass` = 32)

      resp <- trademark_search_page %>% rvest::session_submit(search_form) %>% 
        rvest::session_follow_link(xpath = '//a[@id = "ContentPlaceHolder1_BtnSearch"]')

对我应该做什么有什么建议吗?

r rvest httr
1个回答
1
投票

我认为使用

rvest
可能会很棘手,因为该按钮引用了一个 javascript 脚本。如果您对其他工具持开放态度,这里是如何使用
RSelenium

# load libraries
library(RSelenium)

# define url ---------------------------------------------------------
url <- "https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx"


# define search terms ------------------------------
word_mark <- "Bull"
class_search_term <- "32"

# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4548L, chromever = NULL)
remDr <- rD[["client"]]

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)


# fill in the form ------------------------------------------------
# this finds the html element for each part of the form
# and fills it in with the value we want

# Wordmark
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBWordmark")$sendKeysToElement(list(word_mark))

# Class
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBClass")$sendKeysToElement(list(class_search_term))


# click submit button ---------------------------------------

remDr$findElements("id", "ContentPlaceHolder1_BtnSearch")[[1]]$clickElement()

这是导致页面的样子:

进入此页面后,您可以使用

rvest

获得更多详细信息链接列表
library(rvest)
library(magrittr)

# pull html from page
html <- remDr$getPageSource()[[1]]

# find all the html elements with the .LnkshowDetails class

more_details_butons <- html %>% read_html() %>% 
  html_nodes(".LnkshowDetails") %>%
  html_attr("id")

然后您可以遍历所有按钮并单击它们或提取数据

© www.soinside.com 2019 - 2024. All rights reserved.