使用 python 将大型 RDF 数据上传到 Apache Jena Fuseki 服务器 - 形成太大错误

问题描述 投票:0回答:1

我正在尝试将计算机中存储在 .ttl 文件中的 RDF 数据上传到 Apache Jena Fuseki 服务器。我根据 Apache Jena Fuseki 服务器页面中给出的指导将 Apache Jena Fuseki 服务器作为独立服务器运行(https://jena.apache.org/documentation/fuseki2/fuseki-webapp.html#fuseki-web-application)和在线文章(https://medium.com/@fadirra/setting-up-jena-fuseki-with-update-in-windows-10-2c8a2802ee8f)。 当我访问 localhost:3030 时,服务器似乎正在运行。我开发的用于上传数据的代码似乎对于较小的文件大小运行良好。但是,对于大文件,数据不会上传。在查看服务器日志时,我发现了以下错误:

Caused by: java.lang.IllegalStateException: form too large > 20000000
        at org.eclipse.jetty.server.FormFields.checkMaxLength(FormFields.java:318) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.FormFields.parse(FormFields.java:307) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.FormFields.parse(FormFields.java:39) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.content.ContentSourceCompletableFuture.parse(ContentSourceCompletableFuture.java:104) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1212) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.handler.ContextRequest$OnContextDemand.run(ContextRequest.java:74) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.util.thread.SerializedInvoker$Link.run(SerializedInvoker.java:191) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.internal.HttpConnection$DemandContentCallback.succeeded(HttpConnection.java:679) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[fuseki-server.jar:5.0.0]

这是我用于上传 RDF 数据的代码:

input_location = "C:/......../Added_Triples.ttl"

with open(input_location, 'r') as f:
    content = f.read()

#print(type(content))
rdf_string_no_prefixes = "\n".join(line for line in content.split("\n") if not line.startswith("@prefix"))

update_query = """ 
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX CSRO: <http://www.semanticweb.org/aagr657/ontologies/2023/9/CraneSpaceRepresentationOntology#>
    PREFIX LinkOnt: <http://purl.org/ConstructLinkOnt/LinkOnt#>
    PREFIX bot: <https://w3id.org/bot#>
    PREFIX expr: <https://w3id.org/express#>
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>
    PREFIX geom: <http://rdf.bg/geometry.ttl#>
    PREFIX ifc: <https://standards.buildingsmart.org/IFC/DEV/IFC2X3/TC1/OWL>
    PREFIX inst: <https://www.ugent.be/myAwesomeFirstBIMProject#>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX sf: <http://www.opengis.net/ont/sf#>
    PREFIX omg: <https://w3id.org/omg#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX lbd: <https://linkedbuildingdata.org/LBD#>
    PREFIX props: <http://lbd.arch.rwth-aachen.de/props#>
    PREFIX unit: <http://qudt.org/vocab/unit/>
    PREFIX IFC4-PSD: <https://www.linkedbuildingdata.net/IFC4-PSD#>
    PREFIX smls: <https://w3id.org/def/smls-owl#>
    PREFIX fog: <https://w3id.org/fog#>
    PREFIX cc: <http://creativecommons.org/ns#>
    PREFIX dce: <http://purl.org/dc/elements/1.1/>
    PREFIX express: <https://w3id.org/express#>
    PREFIX list: <https://w3id.org/list#>
    PREFIX vann: <http://purl.org/vocab/vann/>
    PREFIX expr: <https://w3id.org/express#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX : <https://standards.buildingsmart.org/IFC/DEV/IFC2x3/TC1/OWL#>

    INSERT DATA {
        %s
    }
    """ % (rdf_string_no_prefixes)
    sparql = SPARQLWrapper("http://localhost:3030/your-dataset/update")
    sparql.setMethod(POST)
    sparql.setQuery(update_query)

    # Step 5: Execute the SPARQL Update query
    sparql.query()

我在 stackoverflow 上读到了一些关于其他服务器中类似错误的问题,建议编辑 jetty.xml 文件。但是,就我而言,我在计算机中找不到任何此类文件。正如我上面提到的,该代码对于较小的文件大小来说工作得非常好,但问题在于较大的文件大小。 我暂时将较大的RDF文件分成较小的块并分别上传。然而,这需要花费大量时间,因为分块所需的时间不断增加。因此,我不想用这个作为解决方案。 任何有关如何在不需要分块的情况下解决此问题的帮助将不胜感激。在理想的情况下,我希望在最短的时间内一次性上传整个图形文件。 我也使用以下代码尝试了 request.post 方法:

file_location = "C:/.........../Added_Triples.ttl"
sparql_endpoint = "http://localhost:3030/construction_dataset_2/update"  # Adjust the URL accordingly

headers = {'Content-Type': 'text/turtle;charset=utf-8'}
data = open(file_location, 'r').read()
response = requests.post(sparql_endpoint, headers=headers, data=data)```

The error I am getting is as follows:
```Exception has occurred: ConnectionError
('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine```

Also, the server logs show the following:
```11:40:03 INFO  Fuseki          :: [23] 415 Unsupported Media Type (0 ms)
11:40:03 INFO  Fuseki          :: [24] POST http://localhost:3030/construction_dataset_2/update
11:40:03 INFO  Fuseki          :: [24] 415 Unsupported Media Type (0 ms)
11:42:17 INFO  Fuseki          :: [25] POST http://localhost:3030/construction_dataset_2/update```
jetty jena semantic-web fuseki sparqlwrapper
1个回答
0
投票

不要使用表单和 INSERT DATA(此处通过 SPARQLwrapper),而是尝试 POST 一个文件,并适当设置 Content-type 标头。

或使用外部流程:

curl -XPOST -T DATA.ttl --header "Content-type: text/turtle" http://localhost:3030/ds

或者在启动服务器之前加载数据库(TDB2)。这样就可以使用TDB2 buylk加载器了。

© www.soinside.com 2019 - 2024. All rights reserved.