我正在尝试对 Pandas 数据框进行排序,NaT 值位于顶部。我正在使用 df.sort_values 函数:
df=df.sort_values(by='date_of_last_hoorah_given')
它工作正常,我得到了一个排序后的数据框,底部有 NaT 值
date_of_last_hoorah_given email first_name \
16 2016-12-19 07:36:08.000000 [email protected] Mindy
29 2016-12-19 07:36:08.000000 [email protected] Judi
7 2016-12-19 07:36:08.000000 [email protected] Chao
21 2016-12-19 07:36:08.000000 [email protected] Bala
12 2016-12-19 07:36:08.000000 [email protected] Pushpa
30 2016-12-22 07:36:08.000000 [email protected] Sparrow
28 2016-12-22 07:36:08.000000 [email protected] Sanjeev
27 2016-12-22 07:36:08.000000 [email protected] Twinklenose
25 2016-12-22 07:36:08.000000 [email protected] Sweetgaze
23 2016-12-22 07:36:08.000000 [email protected] Shreya
19 2016-12-22 07:36:08.000000 [email protected] Jiahao
15 2016-12-22 07:36:08.000000 [email protected] Janine
14 2016-12-22 07:36:08.000000 [email protected] Arlo
0 2016-12-22 07:36:08.000000 [email protected] Aditya
11 2016-12-22 07:36:08.000000 [email protected] Shirley
2 2016-12-22 07:36:08.000000 [email protected] Minerva
3 2016-12-22 07:36:08.000000 [email protected] Colby
13 2016-12-22 07:36:08.000000 [email protected] Beverly
6 2016-12-22 07:36:08.000000 [email protected] Guanting
5 2016-12-22 07:36:08.000000 [email protected] Chen
18 2016-12-22 10:55:03.474683 [email protected] Fen
9 2016-12-23 07:36:08.000000 [email protected] Kourtney
10 2016-12-23 14:30:55.206581 [email protected] Kailee
4 2016-12-24 07:36:08.000000 [email protected] Jing
31 2016-12-24 16:02:28.945809 [email protected] Rich
24 2016-12-25 07:36:08.000000 [email protected] Ganesh
8 2016-12-26 07:36:08.000000 [email protected] Xia
20 2016-12-27 07:36:08.000000 [email protected] Kinley
22 2016-12-28 07:36:08.000000 [email protected] Honeygleam
26 2016-12-28 15:29:48.629929 [email protected] Indira
17 2016-12-29 02:27:11.125078 [email protected] Ileen
32 2016-12-29 15:38:02.335296 [email protected] Ragnar
1 NaT [email protected] Flitterbeam
但是当我尝试使用以下代码将其置于顶部时:
df=df.sort_values(by='date_of_last_hoorah_given',ascending=[1,0])
我得到一个值错误:升序的长度 (2) != by (1) 的长度 完整的堆栈跟踪如下:
ValueError Traceback (most recent call last)
<ipython-input-107-948a8354aeeb> in <module>()
1 cd = ClientData(1)
----> 2 cd.get_inactive_users()
<ipython-input-106-ed230054ea86> in get_inactive_users(self)
346 inactive_users_result.append(user_dict)
347 df=pd.DataFrame(inactive_users_result)
--> 348 df=df.sort_values(by='date_of_last_hoorah_given',ascending=[1,0])
349 print(df)
C:\Users\aditya\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)
3126 if com.is_sequence(ascending) and len(by) != len(ascending):
3127 raise ValueError('Length of ascending (%d) != length of by (%d)' %
-> 3128 (len(ascending), len(by)))
3129 if len(by) > 1:
3130 from pandas.core.groupby import _lexsort_indexer
ValueError: Length of ascending (2) != length of by (1)
问题是
NaT
在排序时最大,因此总是排在最后。为了按日期升序排序,同时将 NaT
放在前面或顶部,您需要使用两个条件进行排序。
np.lexsort
将按任意数量的条件对数组进行排序,并返回类似于 np.argsort
的排序切片
另请注意,我会将
notnull
条件放在传递给 np.lexsort
的条件数组中的最后。 np.lexsort
首先对最后一个元素进行排序...我不知道为什么,但就是这样。
因此我们应该首先按
df.date_of_last_hoorah_given.notnull()
排序,因为那些不为空的值将具有 True
值,该值大于排序上下文中的 False
。然后我们可以按其余日期排序。
dates = df.date_of_last_hoorah_given
sort_slice = np.lexsort([dates.values, dates.notnull().values])
df.iloc[sort_slice]
或者!正如OP在评论中所说,这给出了同样的事情,而且更加直接
df.sort_values('date_of_last_hoorah_given', na_position='first')
date_of_last_hoorah_given email first_name
1 NaT [email protected] Flitterbeam
16 2016-12-19 07:36:08.000000 [email protected] Mindy
29 2016-12-19 07:36:08.000000 [email protected] Judi
7 2016-12-19 07:36:08.000000 [email protected] Chao
21 2016-12-19 07:36:08.000000 [email protected] Bala
12 2016-12-19 07:36:08.000000 [email protected] Pushpa
30 2016-12-22 07:36:08.000000 [email protected] Sparrow
28 2016-12-22 07:36:08.000000 [email protected] Sanjeev
27 2016-12-22 07:36:08.000000 [email protected] Twinklenose
25 2016-12-22 07:36:08.000000 [email protected] Sweetgaze
23 2016-12-22 07:36:08.000000 [email protected] Shreya
19 2016-12-22 07:36:08.000000 [email protected] Jiahao
15 2016-12-22 07:36:08.000000 [email protected] Janine
14 2016-12-22 07:36:08.000000 [email protected] Arlo
0 2016-12-22 07:36:08.000000 [email protected] Aditya
11 2016-12-22 07:36:08.000000 [email protected] Shirley
2 2016-12-22 07:36:08.000000 [email protected] Minerva
3 2016-12-22 07:36:08.000000 [email protected] Colby
13 2016-12-22 07:36:08.000000 [email protected] Beverly
6 2016-12-22 07:36:08.000000 [email protected] Guanting
5 2016-12-22 07:36:08.000000 [email protected] Chen
18 2016-12-22 10:55:03.474683 [email protected] Fen
9 2016-12-23 07:36:08.000000 [email protected] Kourtney
10 2016-12-23 14:30:55.206581 [email protected] Kailee
4 2016-12-24 07:36:08.000000 [email protected] Jing
31 2016-12-24 16:02:28.945809 [email protected] Rich
24 2016-12-25 07:36:08.000000 [email protected] Ganesh
8 2016-12-26 07:36:08.000000 [email protected] Xia
20 2016-12-27 07:36:08.000000 [email protected] Kinley
22 2016-12-28 07:36:08.000000 [email protected] Honeygleam
26 2016-12-28 15:29:48.629929 [email protected] Indira
17 2016-12-29 02:27:11.125078 [email protected] Ileen
32 2016-12-29 15:38:02.335296 [email protected] Ragnar
您不能在
ascending=[1,0]
中使用 2 个值,因为仅对一列进行排序:
如果需要降序排序使用
False
,True
默认为:
df=df.sort_values(by='date_of_last_hoorah_given',ascending=False)
print (df)
date_of_last_hoorah_given email first_name
1 NaT [email protected] Flitterbeam
32 2016-12-29 15:38:02.335296 [email protected] Ragnar
17 2016-12-29 02:27:11.125078 [email protected] Ileen
26 2016-12-28 15:29:48.629929 [email protected] Indira
22 2016-12-28 07:36:08.000000 [email protected] Honeygleam
20 2016-12-27 07:36:08.000000 [email protected] Kinley
8 2016-12-26 07:36:08.000000 [email protected] Xia
24 2016-12-25 07:36:08.000000 [email protected] Ganesh
31 2016-12-24 16:02:28.945809 [email protected] Rich
4 2016-12-24 07:36:08.000000 [email protected] Jing
10 2016-12-23 14:30:55.206581 [email protected] Kailee
9 2016-12-23 07:36:08.000000 [email protected] Kourtney
18 2016-12-22 10:55:03.474683 [email protected] Fen
3 2016-12-22 07:36:08.000000 [email protected] Colby
5 2016-12-22 07:36:08.000000 [email protected] Chen
6 2016-12-22 07:36:08.000000 [email protected] Guanting
13 2016-12-22 07:36:08.000000 [email protected] Beverly
2 2016-12-22 07:36:08.000000 [email protected] Minerva
11 2016-12-22 07:36:08.000000 [email protected] Shirley
0 2016-12-22 07:36:08.000000 [email protected] Aditya
14 2016-12-22 07:36:08.000000 [email protected] Arlo
15 2016-12-22 07:36:08.000000 [email protected] Janine
...
...
如果需要按两列排序,首先升序,第二降序:
df=df.sort_values(by=['date_of_last_hoorah_given', 'email'],ascending=[True, False])
如果需要使用
NaN
升序排序,第一个可能的解决方案是 concat
拆分 DataFrame:
df.date_of_last_hoorah_given = pd.to_datetime(df.date_of_last_hoorah_given)
df=df.sort_values(by='date_of_last_hoorah_given')
mask = df.date_of_last_hoorah_given.isnull()
print (pd.concat([df[mask], df[~mask]]))
date_of_last_hoorah_given email first_name
1 NaT [email protected] Flitterbeam
16 2016-12-19 07:36:08.000000 [email protected] Mindy
29 2016-12-19 07:36:08.000000 [email protected] Judi
7 2016-12-19 07:36:08.000000 [email protected] Chao
21 2016-12-19 07:36:08.000000 [email protected] Bala
12 2016-12-19 07:36:08.000000 [email protected] Pushpa
5 2016-12-22 07:36:08.000000 [email protected] Chen
6 2016-12-22 07:36:08.000000 [email protected] Guanting
13 2016-12-22 07:36:08.000000 [email protected] Beverly
3 2016-12-22 07:36:08.000000 [email protected] Colby
11 2016-12-22 07:36:08.000000 [email protected] Shirley
0 2016-12-22 07:36:08.000000 [email protected] Aditya
14 2016-12-22 07:36:08.000000 [email protected] Arlo
2 2016-12-22 07:36:08.000000 [email protected] Minerva
19 2016-12-22 07:36:08.000000 [email protected] Jiahao
23 2016-12-22 07:36:08.000000 [email protected] Shreya
15 2016-12-22 07:36:08.000000 [email protected] Janine
25 2016-12-22 07:36:08.000000 [email protected] Sweetgaze
27 2016-12-22 07:36:08.000000 [email protected] Twinklenose
28 2016-12-22 07:36:08.000000 [email protected] Sanjeev
30 2016-12-22 07:36:08.000000 [email protected] Sparrow
18 2016-12-22 10:55:03.474683 [email protected] Fen
9 2016-12-23 07:36:08.000000 [email protected] Kourtney
10 2016-12-23 14:30:55.206581 [email protected] Kailee
4 2016-12-24 07:36:08.000000 [email protected] Jing
...
...
如果您来自 Google 现在有一个 kwarg 可以准确处理这个问题,称为
na_position
:
df.sort_values(
['column1', 'column2'], # columns you want to sort by
na_position='first'
)
这会将您的空值放在前面(而不是默认放在最后)。