我正在使用 pandas 数据框,我考虑过在 networkx 中使用最大流量,但我认为这有点矫枉过正,有其他选择吗?
我尝试过使用
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'column1': [1, 2, 3, 1, 3, 4],
'column2': [5, 6, 7, 8, 9, 7]})
print("Original DataFrame:")
print(df)
# Function to remove the lowest possible number of rows
def remove_duplicate_rows(df):
# Get the unique values in each column
df.drop_duplicates(subset='column1', inplace=True)
df.drop_duplicates(subset='column2', inplace=True)
# Apply the mask to the DataFrame and return the result
return df
# Apply the function to the DataFrame
result = remove_duplicate_rows(df)
print("\nResulting DataFrame:")
print(result)
输出:
Original DataFrame:
column1 column2
0 1 5
1 2 6
2 3 7
3 1 8
4 3 9
5 4 7
Resulting DataFrame:
column1 column2
0 1 5
1 2 6
2 3 7
删除了太多行,有效的输出可能是:
Resulting DataFrame:
column1 column2
0 1 5
1 2 6
2 3 9
3 4 7
我尝试了以下方法,对于您的示例它有效,也许您可以尝试其他示例以查看它是否仍然有效。
这就是我所做的:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'column1': [1, 2, 3, 1, 3, 4],
'column2': [5, 6, 7, 8, 9, 7]})
df['column2_count'] = df['column2'].map(df['column2'].value_counts())
df.sort_values('column2_count', inplace=True)
df.drop_duplicates(subset=['column1'], keep='first', inplace=True)
df.drop_duplicates(subset=['column2'], keep='first', inplace=True)
df[['column1', 'column2']].reset_index(drop=True)
当然,无论您首先选择哪一列,它都应该有效。