我对于下面的代码墙道歉,以及为穷人服务的格式。我尝试尽可能多的方式是我能想到的发现是什么原因导致这些dataframes当我申请或者DataFrame.equals()或更高版本DF1 == DF2返回False。我无法找到他们之间的差异。
我通过对除ORDER_QTY所有列应用GROUPBY到第一(BDF)拿到了第二数据框(dftest)。由于行数是相同的这两个dataframes我认为没有什么变化(这并没有让我感到吃惊。)不过,可以肯定我用bdf.equals(dftest)进行了比较,并返回false。这是我做了肯定的列均正确无误。我注意到唯一的其他事情是dataframes是不一样的尺寸。否则我迷路了......
In:
dftest = bdf.groupby(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'FW_END_DT', 'BPS_INCLUDE']).sum().reset_index()
dftest = dftest[['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE']]
print(bdf.equals(dftest))
print(bdf.columns)
print(dftest.columns)
Out:
False
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
^列似乎是相同的,但bdf.equals(dftest)
产生false
In:
bdf.info()
dftest.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 5.7+ MB
^一切看起来除了大小相同,正如我所说。
In:
common = bdf.merge(dftest,on=['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'], how='outer', indicator=True)
print(common[common['_merge'] != 'both'])
Out:
Empty DataFrame
Columns: [SITE, CUST, ORDER_NUMBER, ORDER_DATE, PURCHASE_ORDER, CHANNEL, SHIP_TO, PROD_LINE, GROUP_NUMBER, DESCRIPTION, ITEM, ORDER_QTY, FW_END_DT, BPS_INCLUDE, _merge]
Index: []
试图合并和DF都选择不排
In:
bdf[(~bdf.SITE.isin(common.SITE))&(~bdf.CUST.isin(common.CUST))&(~bdf.ORDER_NUMBER.isin(common.ORDER_NUMBER))&(~bdf.ORDER_DATE.isin(common.ORDER_DATE))&(~bdf.PURCHASE_ORDER.isin(common.PURCHASE_ORDER))&(~bdf.CHANNEL.isin(common.CHANNEL))&(~bdf.SHIP_TO.isin(common.SHIP_TO))&(~bdf.PROD_LINE.isin(common.PROD_LINE))&(~bdf.GROUP_NUMBER.isin(common.GROUP_NUMBER))&(~bdf.DESCRIPTION.isin(common.DESCRIPTION))&(~bdf.ITEM.isin(common.ITEM))&(~bdf.ORDER_QTY.isin(common.ORDER_QTY))&(~bdf.FW_END_DT.isin(common.FW_END_DT))&(~bdf.BPS_INCLUDE.isin(common.BPS_INCLUDE))]
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
无所事事
In:
(bdf == dftest).all().all()
Out:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-6c2f52f55e60> in <module>()
----> 1 (bdf == dftest).all().all()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in f(self, other)
1611 # Another DataFrame
1612 if not self._indexed_same(other):
-> 1613 raise ValueError('Can only compare identically-labeled '
1614 'DataFrame objects')
1615 return self._compare_frame(other, func, str_rep)
ValueError: Can only compare identically-labeled DataFrame objects
他们并没有标注相同的符号?
当我试图寻找周围下面我建议尝试:
In:
bdf.eq(dftest)
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
0 False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False
5 False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52995 False False False False False False False False False False False False False False
106050 rows × 14 columns
在这种情况下,它看起来像每对电池不匹配的... :(
我缺少的东西完全明显?
你有没有男/空/丢失你的数据值?
如果是这样groupby.sum()可以替换例如这样的值0在数字dtypes的情况下
如果以上是罪魁祸首,groupby.first()的结果将是相同的原始输入