我想使用网页抓取来读取此处每条记录中的信息

问题描述 投票:0回答:1

我想从本网站的每个任务中提取以下信息:

https://aad.archives.gov/aad/display-partial-records.jsp?dt=1802&sc=23947%2C23905%2C23906%2C23880%2C23907%2C23889%2C23890%2C23892%2C23893%2C23894&cat=all& tf=F&bc= %2Csl%2Cfd&q=&as_alq=&as_anq=&as_epq=&as_woq=&nfo_23947=V%2C1%2C1900&cl_23947=&nfo_23905=V%2C25%2C1900&op_23905=0&txt_23905=&nfo_23906=V%2 C2%2C1900&cl_23906=03&nfo_23880=D%2C6%2C1966&op_23880=3&txt_23880=&txt_23880 =&nfo_23907=D%2C6%2C1966&op_23907=3&txt_23907=&txt_23907=&nfo_23889=V%2C10%2C1900&op_23889=0&txt_23889=&nfo_23890=V%2C10%2C1900&op_23890 =0&txt_23890=&nfo_23892=V%2C1%2C1900&cl_23892=E%2CX%2CA%2C7%2C %3D%2CQ%2CR%2CI%2C3%2CV&nfo_23893=V%2C2%2C1900&cl_23893=J0&nfo_23894=N%2C5%2C1900&op_23894=6&txt_23894=0&txt_23894=&rpp=50

  • 操作名称
  • 操作类型代码
  • 主要省份代码
  • UTM 地图坐标
  • 被摧毁或被杀的人数

使用

rvest
我尝试在查看记录下提取每个任务的href,但运气不佳

 results <- read_html("https://aad.archives.gov/aad/display-partial-records.jsp?dt=1802&sc=23947%2C23905%2C23906%2C23880%2C23907%2C23889%2C23890%2C23892%2C23893%2C23894&cat=all&tf=F&bc=%2Csl%2Cfd&q=&as_alq=&as_anq=&as_epq=&as_woq=&nfo_23947=V%2C1%2C1900&cl_23947=&nfo_23905=V%2C25%2C1900&op_23905=0&txt_23905=&nfo_23906=V%2C2%2C1900&cl_23906=03&nfo_23880=D%2C6%2C1966&op_23880=3&txt_23880=&txt_23880=&nfo_23907=D%2C6%2C1966&op_23907=3&txt_23907=&txt_23907=&nfo_23889=V%2C10%2C1900&op_23889=0&txt_23889=&nfo_23890=V%2C10%2C1900&op_23890=0&txt_23890=&nfo_23892=V%2C1%2C1900&cl_23892=E%2CX%2CA%2C7%2C%3D%2CQ%2CR%2CI%2C3%2CV&nfo_23893=V%2C2%2C1900&cl_23893=J0&nfo_23894=N%2C5%2C1900&op_23894=6&txt_23894=0&txt_23894=")
  

missions_url <- results %>% 
  html_nodes("tbody td:nth-child(1)") %>% 
  html_text()

请告诉我如何提取上述信息。谢谢你。

r web-scraping rvest
1个回答
0
投票

我已经能够使用以下代码做到这一点:

library(rvest)
library(RDCOMClient)

url <- "https://aad.archives.gov/aad/display-partial-records.jsp?dt=1802&sc=23947%2C23905%2C23906%2C23880%2C23907%2C23889%2C23890%2C23892%2C23893%2C23894&cat=all&tf=F&bc=%2Csl%2Cfd&q=&as_alq=&as_anq=&as_epq=&as_woq=&nfo_23947=V%2C1%2C1900&cl_23947=&nfo_23905=V%2C25%2C1900&op_23905=0&txt_23905=&nfo_23906=V%2C2%2C1900&cl_23906=03&nfo_23880=D%2C6%2C1966&op_23880=3&txt_23880=&txt_23880=&nfo_23907=D%2C6%2C1966&op_23907=3&txt_23907=&txt_23907=&nfo_23889=V%2C10%2C1900&op_23889=0&txt_23889=&nfo_23890=V%2C10%2C1900&op_23890=0&txt_23890=&nfo_23892=V%2C1%2C1900&cl_23892=E%2CX%2CA%2C7%2C%3D%2CQ%2CR%2CI%2C3%2CV&nfo_23893=V%2C2%2C1900&cl_23893=J0&nfo_23894=N%2C5%2C1900&op_23894=6&txt_23894=0&txt_23894=&rpp=50"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(10)
doc <- IEApp$document()
web_Obj_Table <- doc$getElementByID("queryResults")
html_Content <- read_html(doc$Body()$innerHtml())
list_Html_Table <- html_table(html_Content)
list_Html_Table[[2]]

# A tibble: 51 × 11
   `View Record` `FORCE NATIONALITY` `OPERATION NAME`   `MAJOR PROVINCE CODE` `INITIATION DATE` `TERMINATION DATE` `BRIGADE DESIGNATION` `DIVISION DESIGNATION` `LOSS NATIONALITY`
   <lgl>         <chr>               <chr>              <chr>                 <chr>             <chr>              <chr>                 <chr>                  <chr>             
 1 NA            ""                  ""                 ""                    ""                ""                 ""                    ""                     ""                
 2 NA            "RVN"               "NGU HOANH SON"    "Quang Nam"           "10/22/2065"      ""                 ""                    ""                     "RVN"             
 3 NA            "Marine"            "SUWANNEE"         "Quang Nam"           "08/13/1966"      ""                 "9  MAR"              "3  MAR"               "RVN"             
 4 NA            "RVN"               "HOA TUYEN 147"    "Quang Nam"           "08/19/1966"      "08/27/1966"       ""                    ""                     "RVN"             
 5 NA            "RVN"               "HOA TUYEN 149"    "Quang Nam"           "09/01/1966"      ""                 "51 INF"              "2  INF"               "RVN"             
 6 NA            "RVN"               "HOA TUYEN 149"    "Quang Nam"           "09/01/1966"      "09/04/1966"       "INF"                 "2  INF"               "RVN"             
 7 NA            "RVN"               "HOA TUYEN 153"    "Quang Nam"           "09/19/1966"      "09/24/1966"       ""                    ""                     "RVN"             
 8 NA            "RVN"               "TAO THANH DUY"    "Quang Nam"           "09/24/1966"      "09/26/1966"       ""                    ""                     "RVN"             
 9 NA            "RVN"               "HOA TUYEN 154"    "Quang Nam"           "09/30/1966"      ""                 ""                    ""                     "RVN"             
10 NA            "RVN"               "TRUY KICH TRD 51" "Quang Nam"           "10/16/1966"      ""                 "51 INF"              "2  INF"               "RVN"             
# ℹ 41 more rows
# ℹ 2 more variables: `LOSS CODE` <chr>, `NUMBER DESTROYED OF KILLED` <int>
# ℹ Use `print(n = ...)` to see more rows
© www.soinside.com 2019 - 2024. All rights reserved.