根据非零值的数量将稀疏矩阵划分为更小的块

问题描述 投票:0回答:1

我有以下代码:

import io, sys, numpy, scipy, struct, os
import numpy as np
from scipy import io
from scipy import sparse
from numpy import inf
import random as rnd
import sys
from scipy import sparse

def myhorsplit(matrix,numPartitions):
    csr = sparse.csr_matrix(matrix)
    input_array = csr.getnnz(0)
    print(input_array)
        # Step 1: Calculate the total
    total = sum(input_array)

    # Step 3: Set number of compute units to 4
    num_compute_units = 4
 
    partitions = [[] for _ in range(num_compute_units)]
    current_sums = [0] * num_compute_units
    for element in input_array:
        # Find the partition with the smallest current sum
        min_partition = min(range(num_compute_units), key=lambda i: current_sums[i])

        # Add the element to the selected partition
        partitions[min_partition].append(element)
        current_sums[min_partition] += element

    for i, partition in enumerate(partitions):
        print(f"Partition {i}: {partition}")
    print("Rows Result:")
    print(csr.tolil().rows[0])
    print("Get Column Result:")
    print(csr.tolil().getcol(1))
    
    return partition

# Create an 8x8 adjacency matrix with the modified element
adjacency_matrix = [
    [1, 1, 1, 1, 0, 0, 0, 0],
    [1, 0, 1, 0, 0, 0, 0, 0],
    [1, 1, 0, 1, 0, 0, 0, 0],
    [1, 0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 1, 0, 1],
    [0, 0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 0, 1],
    [0, 0, 1, 0, 1, 0, 1, 0]
]
csr_matrix = sparse.csr_matrix(adjacency_matrix)
# horizontalSplit(csr_matrix,4)
myhorsplit(csr_matrix,4)

打印出:

[4 2 5 2 3 2 1 2]
Partition 0: [4, 1]
Partition 1: [2, 3]
Partition 2: [5]
Partition 3: [2, 2, 2]
Rows Result:
[0, 1, 2, 3]
Get Column Result:
  (0, 0)        1
  (2, 0)        1

我找不到一种方法将该稀疏矩阵划分为较小的块,以便前 3 个块(假设我想要 4 个块)具有总非零值/4,最后一个块具有其余的非零值?

python sparse-matrix
1个回答
0
投票

对于并行化分区问题,您会经历很多看似不必要的麻烦。如果保持分区连续,这将在安装运行时方面表现更好:

def myhorsplit(
    matrix: sparse.sparray, n_compute_units: int = 4,
) -> list[sparse.sparray]:
    nnz = matrix.getnnz(axis=1).cumsum()
    total = nnz[-1]
    ideal_breaks = np.arange(0, total, total/n_compute_units)
    break_idx = [*nnz.searchsorted(ideal_breaks), None]
    return [
        matrix[i: j, :]
        for i, j in zip(break_idx[:-1], break_idx[1:])
    ]

对于小数组,输出不是很准确:

Partition 0: 4 ones, shape (1, 8)
[[1 1 1 1 0 0 0 0]]
Partition 1: 5 ones, shape (2, 8)
[[1 0 1 0 0 0 0 0]
 [1 1 0 1 0 0 0 0]]
Partition 2: 6 ones, shape (3, 8)
[[1 0 1 0 0 0 0 0]
 [0 0 1 0 0 1 0 1]
 [0 0 0 0 1 0 0 0]]
Partition 3: 6 ones, shape (2, 8)
[[0 0 0 0 1 1 0 1]
 [0 0 1 0 1 0 1 0]]

但是你的实际数组很大,并且随着数组大小的增加,准确性会提高;有一个大的、随机生成的数组:

Partition 0: 62316 ones, shape (2498, 50)
Partition 1: 62339 ones, shape (2506, 50)
Partition 2: 62332 ones, shape (2494, 50)
Partition 3: 62348 ones, shape (2502, 50)

如果您想在分区准确性方面做得更好,并且允许任意、不连续的分配,那么这是一个背包问题,它需要自己的时间来运行,并且会使用例如

scipy.optimize.milp

© www.soinside.com 2019 - 2024. All rights reserved.