XML解析返回字符串换行符

问题描述 投票:1回答:2

我试图通过网站地图来解析XML,然后遍历所有的地址,以获得在去后的细节。但我得到这个奇怪的错误:

:在URL不能包含冒号第一路径段

这是代码片段:

type SitemapIndex struct {
    Locations []Location `xml:"sitemap"`
}

type Location struct {
    Loc string `xml:"loc"`
}

func (l Location) String() string {
    return fmt.Sprintf(l.Loc)
}

func main() {
    resp, _ := http.Get("https://www.washingtonpost.com/news-sitemaps/index.xml")
    bytes, _ := ioutil.ReadAll(resp.Body)
    var s SitemapIndex
    xml.Unmarshal(bytes, &s)
    for _, Location := range s.Locations {
        fmt.Printf("Location: %s", Location.Loc)
        resp, err := http.Get(Location.Loc)
        fmt.Println("resp", resp)
        fmt.Println("err", err)
    }
}

和输出:

Location: 
https://www.washingtonpost.com/news-sitemaps/politics.xml
resp <nil>
err parse 
https://www.washingtonpost.com/news-sitemaps/politics.xml
: first path segment in URL cannot contain colon
Location: 
https://www.washingtonpost.com/news-sitemaps/opinions.xml
resp <nil>
err parse 
https://www.washingtonpost.com/news-sitemaps/opinions.xml
: first path segment in URL cannot contain colon
...
...

我的猜测是,qazxsw POI之前和实际地址后,会返回一个新的生产线。例如:qazxsw POI

因为硬编码的URL按预期工作:

Location.Loc

输出,你可以看到的错误是零:

\nLocation: https://www.washingtonpost.com/news-sitemaps/politics.xml\n

但我很新去了,所以我不知道什么是错的。你能告诉我在哪里,我错了吗?

xml go
2个回答
1
投票

您说对了的确,这个问题来自于换行。正如你所看到的,使用的是for _, Location := range s.Locations { fmt.Printf("Location: %s", Location.Loc) test := "https://www.washingtonpost.com/news-sitemaps/politics.xml" resp, err := http.Get(test) fmt.Println("resp", resp) fmt.Println("err", err) } 无需添加任何Location: https://www.washingtonpost.com/news-sitemaps/politics.xml resp &{200 OK 200 HTTP/2.0 2 0 map[Server:[nginx] Arc-Service:[api] Arc-Org-Name:[washpost] Expires:[Sat, 02 Feb 2019 05:32:38 GMT] Content-Security-Policy:[upgrade-insecure-requests] Arc-Deployment:[washpost] Arc-Organization:[washpost] Cache-Control:[private, max-age=60] Arc-Context:[index] Arc-Application:[Feeds] Vary:[Accept-Encoding] Content-Type:[text/xml; charset=utf-8] Arc-Servername:[api.washpost.arcpublishing.com] Arc-Environment:[index] Arc-Org-Env:[washpost] Arc-Route:[/feeds] Date:[Sat, 02 Feb 2019 05:31:38 GMT]] 0xc000112870 -1 [] false true map[] 0xc00017c200 0xc0000ca370} err <nil> Location: ... ... ,一个是在开始和一个在输出末尾添加。

您可以使用Printf消除这些换行符。下面是\n与正在试图解析站点地图的工作。一旦字符串被裁剪,您将能够调用strings.Trim上没有任何错误。

an example

此代码正确输出没有任何换行的位置,符合市场预期:

http.Get

为什么你有这些换行符在func main() { var s SitemapIndex xml.Unmarshal(bytes, &s) for _, Location := range s.Locations { loc := strings.Trim(Location.Loc, "\n") fmt.Printf("Location: %s\n", loc) } } 领域的原因是因为该URL返回的XML的。条目是以下这种形式:

Location: https://www.washingtonpost.com/news-sitemaps/politics.xml
Location: https://www.washingtonpost.com/news-sitemaps/opinions.xml
Location: https://www.washingtonpost.com/news-sitemaps/local.xml
Location: https://www.washingtonpost.com/news-sitemaps/sports.xml
Location: https://www.washingtonpost.com/news-sitemaps/national.xml
Location: https://www.washingtonpost.com/news-sitemaps/world.xml
Location: https://www.washingtonpost.com/news-sitemaps/business.xml
Location: https://www.washingtonpost.com/news-sitemaps/technology.xml
Location: https://www.washingtonpost.com/news-sitemaps/lifestyle.xml
Location: https://www.washingtonpost.com/news-sitemaps/entertainment.xml
Location: https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml

正如你所看到的,也有新行之前和Location.Loc元素中的内容后。


1
投票

见嵌入在修改后的代码中的注释来描述并修复问题

<sitemap>
<loc>
https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml
</loc>
</sitemap>

}

© www.soinside.com 2019 - 2024. All rights reserved.