如何比较两个 Pandas Dataframes 的文本、数字和 None 值

问题描述 投票:0回答:2

我有两个数据框

df1
df2
,除了
None
之外,它们都包含文本和数字数据。但是,
df1
有整数,而
df2
有浮点数。

我试过将它们的相等性与

df1.equals(df2)
进行比较,但是由于类型差异(整数与浮点数)而失败。我也试过做
np.allclose(df1, df2, equal_nan=True)
但这失败了
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
(我想这是因为文本数据)。

如何查看

df1
df2
是否有相同的数据?

python pandas dataframe equality
2个回答
0
投票

不幸的是,似乎没有任何简单的函数可以在这种情况下检查是否相等,因此我们必须构建自己的函数。

为了进行检查,我们将根据它是文本数据(“对象”)还是数字数据来拆分列。然后,我们可以将数字数据与文本数据进行比较。我们还将通过将

None
转换为
Nan
来处理它们,以便 numpy 可以更好地处理它们。

代码如下:

def compare_mixed_dataframes(df1, df2) -> bool:
    # (This code was written by GPT-4, but I've tested it and it works)
    # Get the column names of numerical columns
    num_cols = df1.select_dtypes(include=[np.number]).columns
    
    # Convert numerical columns to float and replace None with NaN
    df1_num = df1[num_cols].astype(float).fillna(np.nan)
    df2_num = df2[num_cols].astype(float).fillna(np.nan)

    # Compare numerical columns with a tolerance value using numpy.allclose()
    num_comparison = np.allclose(df1_num, df2_num, rtol=1e-05, atol=1e-08, equal_nan=True)

    # Compare sentence columns using pandas.DataFrame.equals()
    string_cols = df1.select_dtypes(include=['object']).columns
    str_comparison = df1[string_cols].equals(df2[string_cols])

    # Combine the results of numerical and sentence columns comparisons
    return num_comparison and str_comparison

如果您想自己测试代码,这里有一个快速测试脚本:

# Also written by GPT-4, but edited by me to contain a more advanced test case
# I have also checked to make sure that this works
import numpy as np
import pandas as pd

def compare_mixed_dataframes(df1, df2):
    # Get the column names of numerical columns
    num_cols = df1.select_dtypes(include=[np.number]).columns
    
    # Convert numerical columns to float and replace None with NaN
    df1_num = df1[num_cols].astype(float).fillna(np.nan)
    df2_num = df2[num_cols].astype(float).fillna(np.nan)

    # Compare numerical columns with a tolerance value using numpy.allclose()
    num_comparison = np.allclose(df1_num, df2_num, rtol=1e-05, atol=1e-08, equal_nan=True)

    # Compare sentence columns using pandas.DataFrame.equals()
    string_cols = df1.select_dtypes(include=['object']).columns
    str_comparison = df1[string_cols].equals(df2[string_cols])

    # Combine the results of numerical and sentence columns comparisons
    return num_comparison and str_comparison

# Create example DataFrames with mixed types (ints, floats, text, and Nones)
data1 = {'text': ['hello', 'world', None],
         'num': [None, 2, 3]}
df1 = pd.DataFrame(data1)

data2 = {'text': ['hello', 'world', None],
         'num': [None, 2.0, 3.0]}
df2 = pd.DataFrame(data2)

# DataFrames with different numbers
data3 = {'text': ['hello', 'world', None],
         'num': [None, 2, 4]}
df3 = pd.DataFrame(data3)

# Test the custom function with same and different DataFrames
print(compare_mixed_dataframes(df1, df2))  # True
print(compare_mixed_dataframes(df1, df3))  # False

0
投票

例子

data1 = {'text': ['hello', 'world', None],
         'num': [None, 2, 3]}
df1 = pd.DataFrame(data1)

data2 = {'text': ['hello', 'world', None],
         'num': [None, 2.0, 3.0]}
df2 = pd.DataFrame(data2)

代码

df1.equals(df2.astype(df1.dtypes))

输出:

True

如果您担心转换 dtype 时发生错误,请使用下面的代码。

df1.equals(df2.astype(df1.dtypes, errors='ignore'))

如果你不能将 dtype 更改为相同,那么它们无论如何都不相同

© www.soinside.com 2019 - 2024. All rights reserved.