使用xml.etree.elementtree来处理xml与xmlns="http://www.w3.org/2005/Atom"

问题描述 投票:0回答:0

我正在尝试处理从网络中提取的数据。解码后的原始数据是 xml 文件的字节。我有一些旧代码神奇地起作用了。但是,我不确定他们在做什么,因为已经有一段时间了。

import urllib, urllib.request
url = 'http://export.arxiv.org/api/query?search_query=all:electron+OR+query?id_list=hep-th/9112001&start=0&max_results=2'
data = urllib.request.urlopen(url)

可以用

解码
data.read().decode('utf-8')

有一个表格文件

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D2" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=all:electron OR query?id_list=hep-th/9112001&amp;id_list=&amp;start=0&amp;max_results=2</title>
  <id>http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY</id>
  <updated>2023-03-20T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">194139</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">2</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/abs/cond-mat/0102536v1</id>
    <updated>2001-02-28T20:12:09Z</updated>
    <published>2001-02-28T20:12:09Z</published>
    <title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
    <summary>  The effect of the electron-electron cusp on the convergence of configuration
interaction (CI) wave functions is examined. By analogy with the
pseudopotential approach for electron-ion interactions, an effective
electron-electron interaction is developed which closely reproduces the
scattering of the Coulomb interaction but is smooth and finite at zero
electron-electron separation. The exact many-electron wave function for this
smooth effective interaction has no cusp at zero electron-electron separation.
We perform CI and quantum Monte Carlo calculations for He and Be atoms, both
with the Coulomb electron-electron interaction and with the smooth effective
electron-electron interaction. We find that convergence of the CI expansion of
the wave function for the smooth electron-electron interaction is not
significantly improved compared with that for the divergent Coulomb interaction
for energy differences on the order of 1 mHartree. This shows that, contrary to
popular belief, description of the electron-electron cusp is not a limiting
factor, to within chemical accuracy, for CI calculations.
</summary>
    <author>
      <name>David Prendergast</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>M. Nolan</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
    </author>
    <author>
      <name>Claudia Filippi</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>Stephen Fahy</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>J. C. Greer</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
    </author>
    <arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1063/1.1383585</arxiv:doi>
    <link title="doi" href="http://dx.doi.org/10.1063/1.1383585" rel="related"/>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">11 pages, 6 figures, 3 tables, LaTeX209, submitted to The Journal of
  Chemical Physics</arxiv:comment>
    <arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">J. Chem. Phys. 115, 1626 (2001)</arxiv:journal_ref>
    <link href="http://arxiv.org/abs/cond-mat/0102536v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/cond-mat/0102536v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
  <entry>
    <id>http://arxiv.org/abs/astro-ph/0608371v1</id>
    <updated>2006-08-17T14:05:46Z</updated>
    <published>2006-08-17T14:05:46Z</published>
    <title>Electron thermal conductivity owing to collisions between degenerate
  electrons</title>
    <summary>  We calculate the thermal conductivity of electrons produced by
electron-electron Coulomb scattering in a strongly degenerate electron gas
taking into account the Landau damping of transverse plasmons. The Landau
damping strongly reduces this conductivity in the domain of ultrarelativistic
electrons at temperatures below the electron plasma temperature. In the inner
crust of a neutron star at temperatures T &lt; 1e7 K this thermal conductivity
completely dominates over the electron conductivity due to electron-ion
(electron-phonon) scattering and becomes competitive with the the electron
conductivity due to scattering of electrons by impurity ions.
</summary>
    <author>
      <name>P. S. Shternin</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
    </author>
    <author>
      <name>D. G. Yakovlev</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
    </author>
    <arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1103/PhysRevD.74.043004</arxiv:doi>
    <link title="doi" href="http://dx.doi.org/10.1103/PhysRevD.74.043004" rel="related"/>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">8 pages, 3 figures</arxiv:comment>
    <arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">Phys.Rev. D74 (2006) 043004</arxiv:journal_ref>
    <link href="http://arxiv.org/abs/astro-ph/0608371v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/astro-ph/0608371v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
    <category term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
</feed>

来自网络教程,xml语言是树形结构数据,内容用

包围
<a></a>

与html非常相似,这就是为什么

import xml.etree.ElementTree as ET

可能非常有用。但是,当我使用 python 标准库中的代码时

#https://docs.python.org/3/library/xml.etree.elementtree.html
root=ET.fromstring(data.read().decode('utf-8'))
for child in root:
    print(child.tag, child.attrib)

返回的结果不是想要的:

{http://www.w3.org/2005/Atom}link {'href': 'http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D100', 'rel': 'self', 'type': 'application/atom+xml'}
{http://www.w3.org/2005/Atom}title {'type': 'html'}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}updated {}
{http://a9.com/-/spec/opensearch/1.1/}totalResults {}
{http://a9.com/-/spec/opensearch/1.1/}startIndex {}
{http://a9.com/-/spec/opensearch/1.1/}itemsPerPage {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}

这显然是错误的。我怀疑这是因为多了一行代码

<?xml version="1.0" encoding="UTF-8"?>

在正文前

<feed xmlns="http://www.w3.org/2005/Atom">

从一些阅读来看,这似乎是“命名空间”,但我不确定如何使用它。

你能解释一下我应该怎么做才能搜索文章的标题和那篇文章的摘要吗?

例如,

<title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
<summary>  The effect of the elec... </summary>

明明在同一个节点下,如何识别和收集这些节点和属性到两个列表

[node1, node2, ..., nodeN] #This is to identify and label the <entry></entry>
[id, updated, tile,...,category]

这样一些函数可以用来提取上下文,即

ET.{somefunction}(node1,id)=http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY

既然网站使用

http://www.w3.org/2005/Atom
,这个“命名空间”是否可以提供某种约定或快捷方式?

python-3.x xml elementtree
© www.soinside.com 2019 - 2024. All rights reserved.