如何在markdown中获取特定冒号之间的文本

问题描述 投票:0回答:1

我有多个网页想要研究 Markdown 格式。我现在面临的问题是,降价输出可能会非常混乱,其中包含无用的标签。我希望具有特定名称的特定冒号

:::
之间的所有文本。在这里,我试图制作一个可重现的示例(我剪切了一部分输出,因为它真的很大):

library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"

page = read_html(link)
xml2::write_html(page, file = "SO_page.html")

pandoc_convert("SO_page.html", to = "markdown")

::: site-footer--col
##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}

-   [About](https://stackoverflow.co/){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 1 })"}
-   [Press](https://stackoverflow.co/company/press/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 27 })"}
-   [Work
    Here](https://stackoverflow.co/company/work-here/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 9 })"}
-   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 7 })"}
-   [Privacy
    Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 8 })"}
-   [Terms of
    Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 37 })"}
-   [Contact Us](/contact){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 13 })"}
-   [Cookie Settings]{#consent-footer-link}
-   [Cookie
    Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 39 })"}
:::

创建于 2024-04-29,使用 reprex v2.1.0

现在我想要

site-footer--col
冒号的所有文本。问题是有很多带有特定名称的标注块。冒号的结尾也不清楚。在你的 IDE 中它是不同的颜色。所以我想知道是否有人知道如何提取特定标注块的文本?注意我不想使用 HTML 输出,只想使用 Markdown 输出,因为它的格式。

r markdown pandoc
1个回答
0
投票

我的理解是否正确:您需要提取

::: site-footer--col
和下一个
:::
之间的文本?

我修改了

pandoc_convert()
调用以将结果输出到 SO_page.md,以便我可以将其作为文本读取。然后使用
stringr::str_extract_all()
拉出所需的文本。

参数

dotall = TRUE
multiline = TRUE
允许我们在文档中搜索多行正则表达式。

library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"

page = read_html(link)
xml2::write_html(page, file = "SO_page.html")

pandoc_convert("SO_page.html", to = "markdown", output = "SO_page.md")

markdown <- readr::read_file("SO_page.md")

pattern <- stringr::regex("\\n::: site-footer--col.+?^:::", dotall = TRUE, multiline = TRUE)

footers <- stringr::str_extract_all(markdown, pattern)[[1]]

cat(footers, sep = "\n\n")
#> 
#> ::: site-footer--col
#> ##### [Stack Overflow](https://stackoverflow.com){.js-gps-track gps-track="footer.click({ location: 4, link: 15})"} {#stack-overflow .-title}
#> 
#> -   [Questions](/questions){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 16})"}
#> -   [Help](/help){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 3 })"}
#> :::
#> 
#> 
#> ::: site-footer--col
#> ##### [Products](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 19 })"} {#products .-title}
#> 
#> -   [Teams](https://stackoverflow.co/teams/){.js-gps-track .-link
#>     ga="[\"teams traffic\",\"footer - site nav\",\"stackoverflow.com/teams\",null,{\"dimension4\":\"teams\"}]"
#>     gps-track="footer.click({ location: 4, link: 29 })"}
#> -   [Advertising](https://stackoverflow.co/advertising/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 21 })"}
#> -   [Collectives](https://stackoverflow.co/collectives/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 40 })"}
#> -   [Talent](https://stackoverflow.co/talent/){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 20 })"}
#> :::
#> 
#> 
#> ::: site-footer--col
#> ##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
#> 
#> -   [About](https://stackoverflow.co/){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 1 })"}
#> -   [Press](https://stackoverflow.co/company/press/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 27 })"}
#> -   [Work
#>     Here](https://stackoverflow.co/company/work-here/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 9 })"}
#> -   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 7 })"}
#> -   [Privacy
#>     Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 8 })"}
#> -   [Terms of
#>     Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 37 })"}
#> -   [Contact Us](/contact){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 13 })"}
#> -   [Cookie Settings]{#consent-footer-link}
#> -   [Cookie
#>     Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 39 })"}
#> :::

创建于 2024-04-29,使用 reprex v2.1.0

© www.soinside.com 2019 - 2024. All rights reserved.