循环在r中的网络抓取

Question

我想从bnf网站https://bnf.nice.org.uk/drug/中删除一份药物清单

我们以卡马西平为例 - https://bnf.nice.org.uk/drug/carbamazepine.html#indicationsAndDoses

我希望以下代码循环遍历该药物中的每个适应症，并返回每种适应症的患者类型和剂量。当我最终想要将其作为数据帧时，这是一个问题，因为有7种适应症和大约9种患者类型和剂量。

目前，我得到的指示变量看起来像 -

[1] "Focal and secondary generalised tonic-clonic seizures"  
[2] "Primary generalised tonic-clonic seizures"              
[3] "Trigeminal neuralgia"                                   
[4] "Prophylaxis of bipolar disorder unresponsive to lithium"
[5] "Adjunct in acute alcohol withdrawal "                   
[6] "Diabetic neuropathy"                                    
[7] "Focal and generalised tonic-clonic seizures"

一个患者组变量，看起来像 -

[1] "For \n                        Adult\n                    "                 
[2] "For \n                        Elderly\n                    "               
[3] "For \n                        Adult\n                    "                 
[4] "For \n                        Adult\n                    "                 
[5] "For \n                        Adult\n                    "                 
[6] "For \n                        Adult\n                    "                 
[7] "For \n                        Adult\n                    "                 
[8] "For \n                        Child 1 month–11 years\n                    "
[9] "For \n                        Child 12–17 years\n                    "

我希望如下 -

Indication                                                         Pt group     
[1] "Focal and secondary generalised tonic-clonic seizures"       For Adult
[1] "Focal and secondary generalised tonic-clonic seizures"       For elderly
[2] "Primary generalised tonic-clonic seizures"                   For Adult

等等..

这是我的代码 -

url_list <- paste0("https://bnf.nice.org.uk/drug/", druglist, ".html#indicationsAndDoses")

url_list


## The scraping bit - we are going to extract key bits of information for each drug in the list and craete a data frame

drug_table <- data.frame() # an empty data frame


for(i in seq_along(url_list)){
i=15
## Extract drug name
drug <- read_html(url_list[i]) %>%
  html_nodes("span") %>%
  html_text() %>%
  .[7]



## Extract indication

indication <- read_html(url_list[i]) %>%
  html_nodes(".indication") %>%
  html_text()%>%
unique




## Extact patient group
for (j in seq_along(length(indication))){
pt_group <- read_html(url_list[i]) %>%
  html_nodes(".patientGroupList") %>%
  html_text()




ln <- length(pt_group)


## Extract dose info per pateint group
dose <- read_html(url_list[i]) %>%
  html_nodes("p") %>%
  html_text() %>%
  .[2:(1+ln)]





## Combine pt group and dose
dose1 <- cbind(pt_group, dose)

}
## Create the data frame

drug_df <- data.frame(Drug = drug, Indication = indication, Dose = dose1) 


## Combine data
drug_table <- bind_rows(drug_table, drug_df)

}

Answer 1

该网站实际上已被我阻止！我在那里看不到任何东西，但我可以告诉你，基本上，应该这样做。

html_nodes（）函数将每个HTML标记转换为R数据帧中的一行。

library(rvest)

## Loading required package: xml2

# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"

    scistarter_html <- read_html(URL)
    scistarter_html

## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    \n    \n    <svg style="position: absolute; width: 0; he ...

我们能够检索我们在浏览器中看到的相同HTML代码。这还没用，但它确实表明我们能够检索我们在浏览器中看到的相同HTML代码。现在我们将开始通过HTML过滤来查找我们所追求的数据。

我们想要的数据存储在一个表中，我们可以通过查看“Inspect Element”窗口来判断。

这会抓取其中包含链接的所有节点。

    scistarter_html %>%
      html_nodes("a") %>%
      head()

## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] <a href="/dashboard">My Account</a>
## [3] <a href="/finder" class="is-active">Project Finder</a>
## [4] <a href="/events">Event Finder</a>
## [5] <a href="/people-finder">People Finder</a>
## [6] <a href="#dialog-login" rel="modal:open">log in</a>

在一个更复杂的例子中，我们可以使用它来“抓取”页面，但那是另一天。

页面上的每个div：

    scistarter_html %>%
      html_nodes("div") %>%
      head()

## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n        <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n          < ...
## [3] <div class="nav-tools">\n            <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n              <div class="field">\n ...
## [5] <div class="field">\n                <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n                    <d ...

… the nav-tools div. This calls by css where class=nav-tools.

    scistarter_html %>%
      html_nodes("div.nav-tools") %>%
      head()

## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n            <div class="nav-tools__search"> ...

We can call the nodes by id as follows.

    scistarter_html %>%
      html_nodes("div#project-listing") %>%
      head()

## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n          \n       ...

所有表格如下：

    scistarter_html %>%
      html_nodes("table") %>%
      head()

## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...

有关详细信息，请参阅下面的（相关）链接。

https://rpubs.com/Radcliffe/superbowl

循环在r中的网络抓取

问题描述投票：0回答：1

1个回答

最新问题

循环在r中的网络抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1