我有以下海王星查询
g.V().hasLabel('User')
.has('user_id', 1004)
.repeat(both('USES_UPI','USES_ACCOUNT','USES_HARDWARE_ID','USES_GAID','HAS_COOKIES').simplePath().dedup())
.times(3)
.hasLabel('Gaid')
.dedup()
.count()
而且花费太多时间。
我尝试分析查询
Original Traversal
==================
[GraphStep(vertex,[]), HasStep([~label.eq(User), user_id.eq(159017810)]), RepeatStep([VertexStep(BOTH,[USES_UPI, USES_ACCOUNT, USES_HARDWARE_ID, USES_GAID, HAS_COOKIES],vertex), PathFilterStep(simple,null,null), DedupGlobalStep(null,null), RepeatEndStep],until(loops(3)),emit(false)), HasStep([~label.eq(Gaid)]), DedupGlobalStep(null,null), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[(?1, <user_id>, ?9, ?) . project distinct ?1 . ContainsFilter(?9 in (159017810^^<INT>, 159017810^^<LONG>, 1.59017808E8^^<FLOAT>, 1.5901781E8^^<DOUBLE>)) .], {estimatedCardinality=1, expectedTotalOutput=1, indexTime=0, joinTime=0, numSearches=1, actualTotalOutput=1}
PatternNode[(?1, <~label>, ?2=<User>, <~>) . project ask .], {estimatedCardinality=1327714, expectedTotalOutput=679, actualTotalOutput=379683, indexTime=0, joinTime=0, numSearches=1}
RepeatNode {
Repeat {
JoinGroupNode {
UnionNode {
PatternNode[(?3, ?6, ?4, ?7) . project ?3,?4 . IsEdgeIdFilter(?7) . ContainsFilter(?6 in (<USES_UPI>, <USES_ACCOUNT>, <USES_HARDWARE_ID>, <USES_GAID>, <HAS_COOKIES>)) .], {cacheJoin=true, estimatedCardinality=1923966, indexTime=354, joinTime=26212, numSearches=379683}
PatternNode[(?4, ?6, ?3, ?7) . project ?3,?4 . IsEdgeIdFilter(?7) . ContainsFilter(?6 in (<USES_UPI>, <USES_ACCOUNT>, <USES_HARDWARE_ID>, <USES_GAID>, <HAS_COOKIES>)) .], {cacheJoin=true, estimatedCardinality=1923966, indexTime=356, joinTime=19633, numSearches=379683}
}, annotations={estimatedCardinality=3847932}
SimplePathFilter(?1, ?4)) .
}
}
LoopsCondition {
LoopsFilter(?3,eq(3))
}
}, annotations={emitFirst=false, untilFirst=false, repeatMode=BFS, dedup=true}
}, annotations={path=[Vertex(?1):GraphStep, Repeat[̶V̶e̶r̶t̶e̶x̶(̶?̶3̶)̶:̶G̶r̶a̶p̶h̶S̶t̶e̶p̶, Vertex(?4):VertexStep, ̶V̶e̶r̶t̶e̶x̶(̶?̶8̶)̶:̶V̶e̶r̶t̶e̶x̶S̶t̶e̶p̶]], joinStats=true, optimizationTime=2, maxVarId=10, executionTime=107826}
},
NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: NeptuneHasStep([~label.eq(Gaid)]),
Neptune steps:
[
NeptuneMemoryTrackerStep
]
+ not converted into Neptune steps: DedupGlobalStep(null,null),CountGlobalStep,
WARNING: >> [NeptuneHasStep([~label.eq(Gaid)]), DedupGlobalStep(null,null)] << (or one of the children for each step) is not supported natively yet
Runtime (ms)
============
Query Execution: 107828.341
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(Vertex) 743022 743022 52892.767 49.06
NeptuneTraverserConverterStep 743022 743022 8142.297 7.55
NeptuneHasStep([~label.eq(Gaid)]) 4138 4138 46695.267 43.32
DedupGlobalStep(null,null) 4138 4138 48.927 0.05
CountGlobalStep 1 1 22.215 0.02
>TOTAL - - 107801.475 -
Repeat Metrics
==============
Iteration Visited Output Until Emit Next
------------------------------------------------------
0 1 0 0 0 1
1 3 0 0 0 3
2 379679 0 0 0 379679
3 743022 743022 743022 0 0
------------------------------------------------------
1122705 743022 743022 0 379683
Predicates
==========
# of predicates: 26
Results
=======
Count: 1
Index Operations
================
Query execution:
# of statement index ops: 1,502,390
# of unique statement index ops: 1,502,390
Duplication ratio: 1.0
# of terms materialized: 162
现在需要几秒钟的时间,可以以某种方式优化吗?
所有用户都有唯一的 ID 吗?如果是这样,您可能可以通过将用户 ID 存储为实际顶点 ID 来加速查询的第一部分,而不必费心进行标签检查和查找属性值。您使用数字作为 ID 的事实也会导致延迟增加,因为 Neptune 正在经历应用类型提升作为查询执行的一部分的过程。在本例中,只需使用
user_1004
之类的内容作为用户顶点的 ID。然后你可以这样做:
g.V('user_1004').
repeat(
both(
'USES_UPI',
'USES_ACCOUNT',
'USES_HARDWARE_ID',
'USES_GAID',
'HAS_COOKIES').
simplePath().
dedup()).
times(3).
hasLabel('Gaid').
dedup().
count()
在第二跳之后,图中还出现了大量的扇出。如果图中有超级节点,您可能需要派生一种方法来检测这些超级节点并缓解它们。在大多数 ID 图模式中,超级节点提供的价值很少。因此,在节点上放置递增计数器属性来表示其扇出可以更轻松地避免遍历超级节点。然后在
repeat()
中,您可以添加类似 has('associated_ids', lt(10))
的内容来过滤掉具有更多您关心的下游连接的任何节点。