我正在寻找一种使用 SPARQL 计算 余弦相似度 的方法。
RDF 数据中的向量描述如下:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<http://example.org/london> rdfs:label "London" ;
rdf:_1 0.011788688 ;
rdf:_2 0.006153286 ;
rdf:_3 -0.0034582422 ;
...
rdf:_1536 -0.020006698 .
<http://example.org/united-kingdom> rdfs:label "United Kingdom" ;
rdf:_1 0.007484864 ;
rdf:_2 -0.022806747 ;
rdf:_3 -0.010839927 ;
...
rdf:_1536 0.001866414 .
<http://example.org/united-states> rdfs:label "United States of America" ;
rdf:_1 0.0070878486 ;
rdf:_2 -0.02133514 ;
rdf:_3 -0.000050822895 ;
...
rdf:_1536 -0.012027864 .
我的查询如下所示:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX afn: <http://jena.apache.org/ARQ/function#>
SELECT ?embed1 ?embed2 ((SUM(?dot) / (afn:sqrt(SUM(?v1_squared)) * afn:sqrt(SUM(?v2_squared)))) AS ?similarity)
WHERE {
?embed1 ?p ?v1 .
?embed2 ?p ?v2 .
FILTER (STRSTARTS(STR(?p), str(rdf:_)))
BIND(?v1 * ?v1 AS ?v1_squared)
BIND(?v2 * ?v2 AS ?v2_squared)
BIND(?v1 * ?v2 AS ?dot)
}
GROUP BY ?embed1 ?embed2
ORDER BY DESC(?similarity)
它需要
Jena 的 ARQ 库中的
afn:sqrt
函数,因为标准 SPARQL 1.1 不提供 sqrt
函数。
它似乎有效,但在大数据上可能表现不佳:
----------------------------------------------------------------------------------------------------
| embed1 | embed2 | similarity |
====================================================================================================
| <http://example.org/united-kingdom> | <http://example.org/united-kingdom> | 1.0000000000000002e0 |
| <http://example.org/london> | <http://example.org/london> | 1.0e0 |
| <http://example.org/united-states> | <http://example.org/united-states> | 1.0e0 |
| <http://example.org/united-states> | <http://example.org/united-kingdom> | 0.8804311835944831e0 |
| <http://example.org/united-kingdom> | <http://example.org/united-states> | 0.8804311835944831e0 |
| <http://example.org/london> | <http://example.org/united-kingdom> | 0.8510995877458968e0 |
| <http://example.org/united-kingdom> | <http://example.org/london> | 0.8510995877458968e0 |
| <http://example.org/london> | <http://example.org/united-states> | 0.7855264600385297e0 |
| <http://example.org/united-states> | <http://example.org/london> | 0.7855264600385297e0 |
----------------------------------------------------------------------------------------------------