readLines() 无法打开连接

问题描述 投票:0回答:1

我在 RStudio(RStudio 2023.03.0+386“Cherry Blossom”版本)工作,并尝试从我知道正确的 http 地址

readLines()

代码如下:

con <- url("http://biostat.jhsph.edu/~jleek/contact.html")
htmlCode <- readLines(con)
close(con)

我得到的错误是:

Error in readLines(con) : 
    cannot open the connection to 'https://biostat.jhsph.edu/~jleek/contact.html'
In addition: Warning message:
  In readLines(con) :
    URL 'https://biostat.jhsph.edu/~jleek/contact.html': status was 'SSL connect error'

以下是

sessionInfo()
输出:

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United 
States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMySQL_0.10.25 DBI_1.1.3      sqldf_0.4-11   RSQLite_2.3.1  
gsubfn_0.7     proto_1.0.0    httpuv_1.6.9  
[8] httr_1.4.5     readr_2.1.4   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10      rstudioapi_0.14  magrittr_2.0.3   hms_1.1.3            
bit_4.0.5        R6_2.5.1        
 [7] rlang_1.1.0      fastmap_1.1.1    fansi_1.0.4      blob_1.2.4       
tcltk_4.2.3      tools_4.2.3     
[13] utf8_1.2.3       cli_3.6.0        bit64_4.0.5      tibble_3.2.0     
lifecycle_1.0.3  tzdb_0.3.0      
[19] later_1.3.0      vctrs_0.6.0      promises_1.2.0.1 cachem_1.0.7     
memoise_2.0.1    glue_1.6.2      
[25] compiler_4.2.3   pillar_1.9.0     chron_2.3-60     pkgconfig_2.0.3 
r windows ssl web-scraping readlines
1个回答
0
投票

实际上你的代码对我来说工作得很好,但我运行的是 Linux,所以很难说。也许你需要安装OpenSSL.

你可以尝试不同的

method
url
,

con <- url("https://biostat.jhsph.edu/~jleek/contact.html", method='libcurl')
htmlCode <- readLines(con)
close(con)
head(htmlCode, 5)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
# [2] ""                                                                                                                 
# [3] "<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">"                                        
# [4] ""                                                                                                                 
# [5] "<head>"    

或没有

url
,

htmlCode <- readLines('https://biostat.jhsph.edu/~jleek/contact.html')
head(htmlCode, 1)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"

或者,作为解决方法,尝试先下载文件然后阅读(注意,

download.file
也有一个
method
参数。)。

tmp <- tempfile()
download.file('https://biostat.jhsph.edu/~jleek/contact.html', tmp)  
htmlCode <- readLines(tmp)
unlink(tmp)
head(htmlCode, 1)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"

或者,使用一些包,例如

XML::htmlTreeParse(RCurl::getURL('https://biostat.jhsph.edu/~jleek/contact.html'))$children$html
# <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
#   <head>
#   <meta name="Description" content="Welcome to Jeff Leek&apos;s Research Group"/>
# ...

希望这有帮助。

© www.soinside.com 2019 - 2024. All rights reserved.