我有多个网页想要研究 Markdown 格式。我现在面临的问题是,降价输出可能会非常混乱,其中包含无用的标签。我希望具有特定名称的特定冒号
:::
之间的所有文本。在这里,我试图制作一个可重现的示例(我剪切了一部分输出,因为它真的很大):
library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"
page = read_html(link)
xml2::write_html(page, file = "SO_page.html")
pandoc_convert("SO_page.html", to = "markdown")
::: site-footer--col
##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
- [About](https://stackoverflow.co/){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 1 })"}
- [Press](https://stackoverflow.co/company/press/){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 27 })"}
- [Work
Here](https://stackoverflow.co/company/work-here/){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 9 })"}
- [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 7 })"}
- [Privacy
Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 8 })"}
- [Terms of
Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 37 })"}
- [Contact Us](/contact){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 13 })"}
- [Cookie Settings]{#consent-footer-link}
- [Cookie
Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 39 })"}
:::
创建于 2024-04-29,使用 reprex v2.1.0
现在我想要
site-footer--col
冒号的所有文本。问题是有很多带有特定名称的标注块。冒号的结尾也不清楚。在你的 IDE 中它是不同的颜色。所以我想知道是否有人知道如何提取特定标注块的文本?注意我不想使用 HTML 输出,只想使用 Markdown 输出,因为它的格式。
我的理解是否正确:您需要提取
::: site-footer--col
和下一个:::
之间的文本?
我修改了
pandoc_convert()
调用以将结果输出到 SO_page.md,以便我可以将其作为文本读取。然后使用 stringr::str_extract_all()
拉出所需的文本。
参数
dotall = TRUE
和 multiline = TRUE
允许我们在文档中搜索多行正则表达式。
library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"
page = read_html(link)
xml2::write_html(page, file = "SO_page.html")
pandoc_convert("SO_page.html", to = "markdown", output = "SO_page.md")
markdown <- readr::read_file("SO_page.md")
pattern <- stringr::regex("\\n::: site-footer--col.+?^:::", dotall = TRUE, multiline = TRUE)
footers <- stringr::str_extract_all(markdown, pattern)[[1]]
cat(footers, sep = "\n\n")
#>
#> ::: site-footer--col
#> ##### [Stack Overflow](https://stackoverflow.com){.js-gps-track gps-track="footer.click({ location: 4, link: 15})"} {#stack-overflow .-title}
#>
#> - [Questions](/questions){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 16})"}
#> - [Help](/help){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 3 })"}
#> :::
#>
#>
#> ::: site-footer--col
#> ##### [Products](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 19 })"} {#products .-title}
#>
#> - [Teams](https://stackoverflow.co/teams/){.js-gps-track .-link
#> ga="[\"teams traffic\",\"footer - site nav\",\"stackoverflow.com/teams\",null,{\"dimension4\":\"teams\"}]"
#> gps-track="footer.click({ location: 4, link: 29 })"}
#> - [Advertising](https://stackoverflow.co/advertising/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 21 })"}
#> - [Collectives](https://stackoverflow.co/collectives/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 40 })"}
#> - [Talent](https://stackoverflow.co/talent/){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 20 })"}
#> :::
#>
#>
#> ::: site-footer--col
#> ##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
#>
#> - [About](https://stackoverflow.co/){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 1 })"}
#> - [Press](https://stackoverflow.co/company/press/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 27 })"}
#> - [Work
#> Here](https://stackoverflow.co/company/work-here/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 9 })"}
#> - [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 7 })"}
#> - [Privacy
#> Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 8 })"}
#> - [Terms of
#> Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 37 })"}
#> - [Contact Us](/contact){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 13 })"}
#> - [Cookie Settings]{#consent-footer-link}
#> - [Cookie
#> Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 39 })"}
#> :::
创建于 2024-04-29,使用 reprex v2.1.0