我有一个数据框,其中某些行具有相同的值。我想返回 df 中从 Num1 到 Num7 具有超过 1 个相同值的行号和值。
还想返回行号和相同的值。要么我就同意。
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,2,4,5,6,7,7],
[2,5,6,7,22,23,34,48],
[3,3,5,6,7,45,46,48],
[4,6,7,14,29,32,6,29], # duplicate 6 and 29
[5,6,7,13,23,33,35,7], # duplicate 7 but nothing else
[6,1,6,7,8,9,10,8],
[7,0,2,5,7,19,7,5]], # duplicate 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5','Num6','Num7'])
df2 = pd.DataFrame([[4,6,7,14,29,32], # Return 6 and 29
[7,0,2,5,7,19]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5'])
df3 = pd.DataFrame([[4,6,29], # Return 6 and 29
[7,7,5]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2'])
IIUC,您可以使用 pd.duplicated 进行一些数据操作来获得结果:
df = df.set_index('Row_Num') # set index
df_duplicated = df.transform(lambda x: x.duplicated(), axis=1) # returns if rows have duplicates
# First result where
res1 = df[df_duplicated.sum(axis=1) >= 2][~df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
# second result
res2 = df[df_duplicated.sum(axis=1) >= 2][df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
输出:
result1
Num1 Num2 Num3 Num4 Num5
Row_Num
4 6 7 14 29 32
7 0 2 5 7 19
result2
:
Num6 Num7
Row_Num
4 6 29
7 7 5
要完全匹配您的输出,只需
reset_index
并重命名第二个结果的列名称。