我正在通过 RDFLib 查询 DBPedia 的 Virtuoso 端点,以便获取 dbo:Politician 类型的所有实体,除此之外没有其他职业,并且我注意到,在执行查询时得到的结果在
上增加了
OFFSET
LIMIT
(10000) 不包含所有结果
def get_persons_for_occupation(occupation_URI):
offset = 0
limit = 10000 # DBPedia's Virtuoso SPARQL limit
persons = []
while True:
g = Graph()
try:
query = """
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT *
WHERE {
SERVICE <https://dbpedia.org/sparql> {
SELECT ?person_with_occupation ?wikipedia_URL ?wikidata_URI
WHERE {
?person_with_occupation rdf:type/rdfs:subClassOf* %s.
# TODO for debugging
# FILTER(regex(?person_with_occupation, "Trump"))
# Discard persons with an occupation (class) different than ours
FILTER NOT EXISTS {
?person_with_occupation a ?other_occupation.
# That is not the occupation itself
FILTER ((?other_occupation != %s)
&&
# That is not a subclass of ours (* allows for indirect subclasses through the type hierarchy)
NOT EXISTS { ?other_occupation rdfs:subClassOf* %s }
&&
# And that is a subclass of dbo:Person
EXISTS { ?other_occupation rdfs:subClassOf dbo:Person })
}
# They have a Wikipedia article
?person_with_occupation foaf:isPrimaryTopicOf ?wikipedia_URL.
# And also an equivalent URI in Wikidata (in order to get its PageRank)
?person_with_occupation owl:sameAs ?wikidata_URI.
FILTER (STRSTARTS(STR(?wikidata_URI), "http://www.wikidata.org"))
}
LIMIT 10000
OFFSET %s
}
}
""" % (occupation_URI, occupation_URI, occupation_URI, offset)
qres = g.query(prepareQuery(query))
except SPARQLResult as e:
# Received correct but partial results (on the final offset),
# we don't want it to be an exception
if e.response.status_code == 206:
qres = JSONResultParser().parse(e.response.content)
else:
raise
n_results = len(qres)
if n_results == 0:
break
#for row in qres:
#do stuff
offset += limit
哪里
occupation_URI = "dbo:Politician"
在收集所有结果时,我注意到我得到了 27792 个实体,但是如果我要求
COUNT
的话,其中有 74128 个(特别是,像唐纳德·特朗普这样的一些实体不会返回,但如果我 FILTER
的话,它被返回)。是否存在我不知道的硬性限制?
这可能是由“随时查询”功能/奇怪的行为/错误引起的,该功能让 Virtuoso 在不以符合标准的方式告诉客户端的情况下返回不完整的结果。这甚至发生在聚合内部,这可能解释不同的 COUNT 结果。 (详细信息可以在 openlink/virtuoso-opensource#112 的漫长而无结果的讨论中找到。)客户端可以通过检查 HTTP 响应标头 X-SQL-State: S1TAT
来识别不完整的结果。 (但是哪个客户已经这样做了?)
offset += n_results