如何删除两列数据集中尽可能少的行，以便每列都有唯一的值？

Question

我正在使用 pandas 数据框，我考虑过在 networkx 中使用最大流量，但我认为这有点矫枉过正，有其他选择吗？

我尝试过使用

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'column1': [1, 2, 3, 1, 3, 4],
                   'column2': [5, 6, 7, 8, 9, 7]})

print("Original DataFrame:")
print(df)

# Function to remove the lowest possible number of rows
def remove_duplicate_rows(df):
    # Get the unique values in each column
    df.drop_duplicates(subset='column1', inplace=True)

    df.drop_duplicates(subset='column2', inplace=True)

    # Apply the mask to the DataFrame and return the result
    return df

# Apply the function to the DataFrame
result = remove_duplicate_rows(df)

print("\nResulting DataFrame:")
print(result)

输出：

Original DataFrame:
   column1  column2
0        1        5
1        2        6
2        3        7
3        1        8
4        3        9
5        4        7

Resulting DataFrame:
   column1  column2
0        1        5
1        2        6
2        3        7

删除了太多行，有效的输出可能是：

Resulting DataFrame:
   column1  column2
0        1        5
1        2        6
2        3        9
3        4        7

Answer 1

我尝试了以下方法，对于您的示例它有效，也许您可以尝试其他示例以查看它是否仍然有效。

这就是我所做的：

计算 2 列之一中出现的次数并将其添加为第三列
按该列对数据框进行排序
通过保留“第一个”来删除另一列上的重复项（意味着其他列中出现的次数更少）
删除剩余列中的重复项只是为了确保顺序不再重要

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'column1': [1, 2, 3, 1, 3, 4],
                   'column2': [5, 6, 7, 8, 9, 7]})

df['column2_count'] = df['column2'].map(df['column2'].value_counts())
df.sort_values('column2_count', inplace=True)
df.drop_duplicates(subset=['column1'], keep='first', inplace=True)
df.drop_duplicates(subset=['column2'], keep='first', inplace=True)
df[['column1', 'column2']].reset_index(drop=True)

当然，无论您首先选择哪一列，它都应该有效。

如何删除两列数据集中尽可能少的行，以便每列都有唯一的值？

问题描述投票：0回答：1

1个回答

最新问题

如何删除两列数据集中尽可能少的行，以便每列都有唯一的值？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1