查询 DBPedia 时 LIMIT 和 OFFSET 的奇怪行为

问题描述 投票:0回答:1

我正在通过 RDFLib 查询 DBPedia 的 Virtuoso 端点,以便获取 dbo:Politician 类型的所有实体,除此之外没有其他职业,并且我注意到,在执行查询时得到的结果在

 上增加了 
OFFSET
 LIMIT
(10000) 不包含所有结果

def get_persons_for_occupation(occupation_URI):
    offset = 0
    limit = 10000 # DBPedia's Virtuoso SPARQL limit
    persons = []

    while True:
        g = Graph()

        try:
            query = """
                PREFIX dbo: <http://dbpedia.org/ontology/>
                PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                PREFIX foaf: <http://xmlns.com/foaf/0.1/>
                PREFIX owl: <http://www.w3.org/2002/07/owl#>

                SELECT *
                WHERE {
                    SERVICE <https://dbpedia.org/sparql> {
                        SELECT ?person_with_occupation ?wikipedia_URL ?wikidata_URI
                        WHERE {
                            ?person_with_occupation rdf:type/rdfs:subClassOf* %s.
                            # TODO for debugging
                            # FILTER(regex(?person_with_occupation, "Trump"))

                            # Discard persons with an occupation (class) different than ours
                            FILTER NOT EXISTS {
                                ?person_with_occupation a ?other_occupation.
                                # That is not the occupation itself
                                FILTER ((?other_occupation != %s)
                                        &&
                                        # That is not a subclass of ours (* allows for indirect subclasses through the type hierarchy)
                                        NOT EXISTS { ?other_occupation rdfs:subClassOf* %s }
                                        &&
                                        # And that is a subclass of dbo:Person
                                        EXISTS { ?other_occupation rdfs:subClassOf dbo:Person })
                            }

                            # They have a Wikipedia article
                            ?person_with_occupation foaf:isPrimaryTopicOf ?wikipedia_URL.

                            # And also an equivalent URI in Wikidata (in order to get its PageRank)
                            ?person_with_occupation owl:sameAs ?wikidata_URI.
                            FILTER (STRSTARTS(STR(?wikidata_URI), "http://www.wikidata.org"))
                        }
                        LIMIT 10000
                        OFFSET %s
                    }
                }
                """ % (occupation_URI, occupation_URI, occupation_URI, offset)

            qres = g.query(prepareQuery(query))

        except SPARQLResult as e:
            # Received correct but partial results (on the final offset), 
            # we don't want it to be an exception
            if e.response.status_code == 206:
                qres = JSONResultParser().parse(e.response.content)
            else:
                raise

        n_results = len(qres)
        if n_results == 0:
            break

        #for row in qres:
            #do stuff
        
        offset += limit

哪里

occupation_URI = "dbo:Politician"

在收集所有结果时,我注意到我得到了 27792 个实体,但是如果我要求

COUNT
的话,其中有 74128 个(特别是,像唐纳德·特朗普这样的一些实体不会返回,但如果我
FILTER
的话,它被返回)。是否存在我不知道的硬性限制?

sparql rdf dbpedia virtuoso rdflib
1个回答
0
投票

这可能是由“随时查询”功能/奇怪的行为/错误引起的,该功能让 Virtuoso 在不以符合标准的方式告诉客户端的情况下返回不完整的结果。这甚至发生在聚合内部,这可能解释不同的 COUNT 结果。 (详细信息可以在 openlink/virtuoso-opensource#112 的漫长而无结果的讨论中找到。)客户端可以通过检查 HTTP 响应标头 X-SQL-State: S1TAT 来识别不完整的结果。 (但是哪个客户已经这样做了?)

在您的情况下,我只需更改代码的最后一行,以增加收到的实际行数(绑定)的偏移量:

offset += n_results

© www.soinside.com 2019 - 2024. All rights reserved.