SQL中图的连接组件数

问题描述 投票:1回答:2

我的[[PostgreSQL数据库中有一个图,为便于示例,我们将其定义如下:

CREATE TABLE nodes (node_id INTEGER); CREATE TABLE roads (road_id INTEGER, nodes INTEGER[]); INSERT INTO nodes VALUES (1), (2), (3), (4), (5); INSERT INTO roads VALUES (1, {1, 2}), (2, {3, 4}));
我想创建返回图的connected components数的SQL查询,在此示例中,该数是

3

,因为节点1/2被连接,3/4也被连接,而5没有被连接连接到任何东西。我曾尝试在SQL中搜索find&union实现,但无济于事,然后我求助于CTEs,但我不能自己做,我在想这样的事情:

WITH RECURSIVE cc(iterator_id, node_id, rank, iterator) AS ( SELECT row_number() OVER(), n.node_id, row_number() OVER (), 1 FROM nodes AS n UNION ALL # Something here that does the magic ) SELECT COUNT(DISTINCT rank) AS no_of_cc FROM cc, (SELECT COUNT(*) FROM nodes) AS last_iterator_id WHERE iterator = last_iterator_id;

在每次迭代中,我们更新iterator_id <=迭代器的行的等级。我们迭代直到iterator等于最大iterator_id但我想不到递归部分。

您能帮我找到连接的组件数量吗?

sql postgresql graph common-table-expression recursive-query
2个回答
1
投票
你现在是什么?尽管我向您推荐了使用PL / Python编写存储过程的建议,但后来我还是决定编写该单SQL查询只是为了好玩。这就是我所做的。我使用了RECURSIVE CTE

WITH RECURSIVE graph_search(node_id, connected_to, path, cycle) AS ( SELECT node_id, connected_to, ARRAY[node_id], false FROM paths UNION SELECT p.node_id, p.connected_to, gs.path || p.node_id, p.node_id=ANY(gs.path) FROM graph_search gs JOIN paths p ON gs.connected_to = p.node_id AND NOT gs.cycle ), paths AS ( SELECT node_id, connected_to FROM ( SELECT n.node_id, unnest(r.nodes) AS connected_to FROM nodes n JOIN roads r ON n.node_id = ANY(r.nodes) ) sub WHERE node_id <> connected_to ) SELECT count(DISTINCT component) FROM ( SELECT node_id, array_agg(DISTINCT reachable_node ORDER BY reachable_node) as component FROM ( SELECT node_id, unnest(path) as reachable_node from graph_search ) sub GROUP BY node_id UNION ALL /*need to append lonely nodes - they are components for themselves*/ SELECT node_id, ARRAY[node_id] FROM nodes WHERE node_id NOT IN (SELECT node_id from paths) ) sub;

    首先,我需要图形本身的不同表示形式。名为CTE的普通paths创建带有连接节点对的两列表。
  • 然后我稍微修改了example from PostgreSQL manual,所以我有了节点列表以及从中可以访问的每个节点。
  • 聚合为我提供了图形的组成部分。
  • 最后,我计算不同的组成部分。

0
投票
如果节点数太大,上述解决方案将不起作用。

最有效的解决方案(只要您有足够的RAM来读取所有数据)是使用C或C ++之类的语言将数据读取到内存中并在其中执行计算。

但是如果数据量太大而您别无选择,那么您可以这样进行:

((plpgssql实现,假设我们有表路(node1,node2))

CREATE TABLE node AS SELECT DISTINCT node1 AS id, node1 AS color FROM roads CREATE OR REPLACE FUNCTION merge_node() RETURNS VOID AS $$ DECLARE left_to_do INT := 1; counter INT :=1; row record; BEGIN DROP TABLE IF EXISTS t; CREATE TEMP TABLE t ( node1 INT, prev INT, next INT ); WHILE left_to_do > 0 LOOP WITH joined_table AS ( SELECT roads.node1, MAX (v1.color) AS prev, MAX (v2.color) AS next FROM roads JOIN node v1 ON roads.node1 = v1.id JOIN node v2 ON roads.node2 = v2.id GROUP BY roads.node1 ) INSERT INTO t (node1, prev, next) SELECT node1, prev, next FROM joined_table WHERE prev < next; SELECT COUNT(*) INTO left_to_do FROM t; UPDATE node color SET color = t.next FROM t WHERE color.id = t.node1; DELETE FROM t; counter := counter + 1; END LOOP; END; $$ LANGUAGE plpgsql;

如果节点度数比节点数低,这应该会更好。在带有240万个节点和2400万个边缘的图形上对其进行了测试,并用了大约30-60分钟的索引时间。 (相比之下,在C ++中,它花费2.5分钟的时间大部分时间是从csv读取数据/将数据写入csv)
© www.soinside.com 2019 - 2024. All rights reserved.