如何对2D二进制矩阵进行混洗，保留边际分布

Question

假设我有一个（n * m）二进制矩阵df，类似于以下内容：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   0   1   1   0
1 | 0   1   0   0   0   0   0   0
2 | 0   0   0   0   1   0   0   0
3 | 0   0   0   0   0   1   0   1
4 | 0   1   1   0   1   0   0   0
5 | 1   0   1   1   1   0   0   1

我想对矩阵中的值进行混洗以创建相同形状的new_df，以使两个边际分布都相同，例如：

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   1   0   0   1
1 | 0   0   0   0   1   0   0   0
2 | 0   0   0   0   0   0   0   1
3 | 0   1   1   0   0   0   0   0
4 | 1   0   0   0   1   1   0   0
5 | 0   1   1   1   0   1   1   0

在新矩阵中，每一行的总和等于原始矩阵中相应行的总和，同样，新矩阵中的列具有与原始矩阵中相应列相同的总和。

该解决方案很容易检查：

# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))  

# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))

如果n * m小，我可以使用强力方法进行随机播放：

def shuffle_2d(df):
    """Shuffles a multidimensional binary array, preserving marginal distributions"""
    # get a list of indices where the df is 1
    rowlist = []
    collist = []
    for i_row, row in df.iterrows():
        for i_col, val in row.iteritems():
            if df.loc[i_row, i_col] == 1:
                rowlist.append(i_row)
                collist.append(i_col)

    # create an empty df of the same shape
    new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)

    # shuffle until you get no repeat coordinates 
    # this is so you don't increment the same cell in the matrix twice
    repeats = 999
    while repeats > 1:
        pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
        repeats = pd.value_counts(pairs).max()

    # populate new data frame at indicated points
    for i_row, i_col in pairs:
        new_df.at[i_row, i_col] += 1

    return new_df

问题是蛮力进尺可怜。（如印第安纳琼斯和《最后的十字军东征：https://youtu.be/Ubw5N8iVDHI?t=3）》>

作为快速演示，对于n * n矩阵，获得可接受的随机播放所需的尝试次数如下：（一次运行）

n attempts 2 1 3 2 4 4 5 1 6 1 7 11 8 9 9 22 10 4416 11 800 12 66 13 234 14 5329 15 26501 16 27555 17 5932 18 668902 ...

是否有一个简单的解决方案可以保留精确的边际分布（或告诉您在哪里没有其他模式可以保留该分布）？

作为后备，我还可以使用一种近似算法，该算法可以使每行的平方误差之和最小。

谢谢！ =）

编辑：由于某种原因，我在写这个问题之前没有找到现有的答案，但是在发布之后，它们都显示在侧栏中：

Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?

Randomize matrix in perl, keeping row and column totals the same

有时您需要做的只是问...

假设我有一个（n * m）二进制矩阵df，类似于以下内容：以pd形式导入大熊猫以np形式导入numpy df = pd.DataFrame（np.random.binomial（1，.3，size =（6,8 ）））0 1 2 3 4 5 6 7 ...

Answer 1

主要感谢https://stackoverflow.com/a/2137012/6361632的启发，这似乎是可行的解决方案：

如何对2D二进制矩阵进行混洗，保留边际分布

问题描述投票：1回答：1

1个回答

最新问题

如何对2D二进制矩阵进行混洗，保留边际分布

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1