对 Pandas 数据框进行排序,NaT 值位于顶部

问题描述 投票:0回答:3

我正在尝试对 Pandas 数据框进行排序,NaT 值位于顶部。我正在使用 df.sort_values 函数:

df=df.sort_values(by='date_of_last_hoorah_given')

它工作正常,我得到了一个排序后的数据框,底部有 NaT 值

    date_of_last_hoorah_given                              email   first_name  \
16 2016-12-19 07:36:08.000000              [email protected]        Mindy   
29 2016-12-19 07:36:08.000000              [email protected]         Judi   
7  2016-12-19 07:36:08.000000                  [email protected]         Chao   
21 2016-12-19 07:36:08.000000              [email protected]         Bala   
12 2016-12-19 07:36:08.000000            [email protected]       Pushpa   
30 2016-12-22 07:36:08.000000       [email protected]      Sparrow   
28 2016-12-22 07:36:08.000000         [email protected]      Sanjeev   
27 2016-12-22 07:36:08.000000     [email protected]  Twinklenose   
25 2016-12-22 07:36:08.000000       [email protected]    Sweetgaze   
23 2016-12-22 07:36:08.000000            [email protected]       Shreya   
19 2016-12-22 07:36:08.000000              [email protected]       Jiahao   
15 2016-12-22 07:36:08.000000            [email protected]       Janine   
14 2016-12-22 07:36:08.000000                [email protected]         Arlo   
0  2016-12-22 07:36:08.000000         [email protected]       Aditya   
11 2016-12-22 07:36:08.000000        [email protected]      Shirley   
2  2016-12-22 07:36:08.000000             [email protected]     Minerva    
3  2016-12-22 07:36:08.000000             [email protected]        Colby   
13 2016-12-22 07:36:08.000000            [email protected]      Beverly   
6  2016-12-22 07:36:08.000000             [email protected]     Guanting   
5  2016-12-22 07:36:08.000000                  [email protected]         Chen   
18 2016-12-22 10:55:03.474683                  [email protected]          Fen   
9  2016-12-23 07:36:08.000000             [email protected]     Kourtney   
10 2016-12-23 14:30:55.206581             [email protected]       Kailee   
4  2016-12-24 07:36:08.000000                [email protected]        Jing    
31 2016-12-24 16:02:28.945809               [email protected]         Rich   
24 2016-12-25 07:36:08.000000           [email protected]       Ganesh   
8  2016-12-26 07:36:08.000000               [email protected]          Xia   
20 2016-12-27 07:36:08.000000              [email protected]       Kinley   
22 2016-12-28 07:36:08.000000   [email protected]   Honeygleam   
26 2016-12-28 15:29:48.629929             [email protected]       Indira   
17 2016-12-29 02:27:11.125078             [email protected]        Ileen   
32 2016-12-29 15:38:02.335296            [email protected]       Ragnar   
1                         NaT  [email protected]  Flitterbeam   

但是当我尝试使用以下代码将其置于顶部时:

df=df.sort_values(by='date_of_last_hoorah_given',ascending=[1,0])

我得到一个值错误:升序的长度 (2) != by (1) 的长度 完整的堆栈跟踪如下:

ValueError                                Traceback (most recent call last)
<ipython-input-107-948a8354aeeb> in <module>()
      1 cd = ClientData(1)
----> 2 cd.get_inactive_users()

<ipython-input-106-ed230054ea86> in get_inactive_users(self)
    346             inactive_users_result.append(user_dict)
    347         df=pd.DataFrame(inactive_users_result)
--> 348         df=df.sort_values(by='date_of_last_hoorah_given',ascending=[1,0])
    349         print(df)

C:\Users\aditya\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)
   3126         if com.is_sequence(ascending) and len(by) != len(ascending):
   3127             raise ValueError('Length of ascending (%d) != length of by (%d)' %
-> 3128                              (len(ascending), len(by)))
   3129         if len(by) > 1:
   3130             from pandas.core.groupby import _lexsort_indexer

ValueError: Length of ascending (2) != length of by (1)
python sorting pandas numpy
3个回答
6
投票

问题是

NaT
在排序时最大,因此总是排在最后。为了按日期升序排序,同时将
NaT
放在前面或顶部,您需要使用两个条件进行排序。

np.lexsort
将按任意数量的条件对数组进行排序,并返回类似于
np.argsort

的排序切片

另请注意,我会将

notnull
条件放在传递给
np.lexsort
的条件数组中的最后。
np.lexsort
首先对最后一个元素进行排序...我不知道为什么,但就是这样。

因此我们应该首先按

df.date_of_last_hoorah_given.notnull()
排序,因为那些不为空的值将具有
True
值,该值大于排序上下文中的
False
。然后我们可以按其余日期排序。

dates = df.date_of_last_hoorah_given
sort_slice = np.lexsort([dates.values, dates.notnull().values])
df.iloc[sort_slice]

或者!正如OP在评论中所说,这给出了同样的事情,而且更加直接

df.sort_values('date_of_last_hoorah_given', na_position='first')

     date_of_last_hoorah_given                              email   first_name
1                          NaT  [email protected]  Flitterbeam
16  2016-12-19 07:36:08.000000              [email protected]        Mindy
29  2016-12-19 07:36:08.000000              [email protected]         Judi
7   2016-12-19 07:36:08.000000                  [email protected]         Chao
21  2016-12-19 07:36:08.000000              [email protected]         Bala
12  2016-12-19 07:36:08.000000            [email protected]       Pushpa
30  2016-12-22 07:36:08.000000       [email protected]      Sparrow
28  2016-12-22 07:36:08.000000         [email protected]      Sanjeev
27  2016-12-22 07:36:08.000000     [email protected]  Twinklenose
25  2016-12-22 07:36:08.000000       [email protected]    Sweetgaze
23  2016-12-22 07:36:08.000000            [email protected]       Shreya
19  2016-12-22 07:36:08.000000              [email protected]       Jiahao
15  2016-12-22 07:36:08.000000            [email protected]       Janine
14  2016-12-22 07:36:08.000000                [email protected]         Arlo
0   2016-12-22 07:36:08.000000         [email protected]       Aditya
11  2016-12-22 07:36:08.000000        [email protected]      Shirley
2   2016-12-22 07:36:08.000000             [email protected]      Minerva
3   2016-12-22 07:36:08.000000             [email protected]        Colby
13  2016-12-22 07:36:08.000000            [email protected]      Beverly
6   2016-12-22 07:36:08.000000             [email protected]     Guanting
5   2016-12-22 07:36:08.000000                  [email protected]         Chen
18  2016-12-22 10:55:03.474683                  [email protected]          Fen
9   2016-12-23 07:36:08.000000             [email protected]     Kourtney
10  2016-12-23 14:30:55.206581             [email protected]       Kailee
4   2016-12-24 07:36:08.000000                [email protected]         Jing
31  2016-12-24 16:02:28.945809               [email protected]         Rich
24  2016-12-25 07:36:08.000000           [email protected]       Ganesh
8   2016-12-26 07:36:08.000000               [email protected]          Xia
20  2016-12-27 07:36:08.000000              [email protected]       Kinley
22  2016-12-28 07:36:08.000000   [email protected]   Honeygleam
26  2016-12-28 15:29:48.629929             [email protected]       Indira
17  2016-12-29 02:27:11.125078             [email protected]        Ileen
32  2016-12-29 15:38:02.335296            [email protected]       Ragnar

4
投票

您不能在

ascending=[1,0]
中使用 2 个值,因为仅对一列进行排序:

如果需要降序排序使用

False
True
默认为:

df=df.sort_values(by='date_of_last_hoorah_given',ascending=False)
print (df)
     date_of_last_hoorah_given                              email   first_name
1                          NaT  [email protected]  Flitterbeam
32  2016-12-29 15:38:02.335296            [email protected]       Ragnar
17  2016-12-29 02:27:11.125078             [email protected]        Ileen
26  2016-12-28 15:29:48.629929             [email protected]       Indira
22  2016-12-28 07:36:08.000000   [email protected]   Honeygleam
20  2016-12-27 07:36:08.000000              [email protected]       Kinley
8   2016-12-26 07:36:08.000000               [email protected]          Xia
24  2016-12-25 07:36:08.000000           [email protected]       Ganesh
31  2016-12-24 16:02:28.945809               [email protected]         Rich
4   2016-12-24 07:36:08.000000                [email protected]         Jing
10  2016-12-23 14:30:55.206581             [email protected]       Kailee
9   2016-12-23 07:36:08.000000             [email protected]     Kourtney
18  2016-12-22 10:55:03.474683                  [email protected]          Fen
3   2016-12-22 07:36:08.000000             [email protected]        Colby
5   2016-12-22 07:36:08.000000                  [email protected]         Chen
6   2016-12-22 07:36:08.000000             [email protected]     Guanting
13  2016-12-22 07:36:08.000000            [email protected]      Beverly
2   2016-12-22 07:36:08.000000             [email protected]      Minerva
11  2016-12-22 07:36:08.000000        [email protected]      Shirley
0   2016-12-22 07:36:08.000000         [email protected]       Aditya
14  2016-12-22 07:36:08.000000                [email protected]         Arlo
15  2016-12-22 07:36:08.000000            [email protected]       Janine
...
...

如果需要按两列排序,首先升序,第二降序:

df=df.sort_values(by=['date_of_last_hoorah_given', 'email'],ascending=[True, False])

如果需要使用

NaN
升序排序,第一个可能的解决方案是
concat
拆分 DataFrame:

df.date_of_last_hoorah_given = pd.to_datetime(df.date_of_last_hoorah_given)
df=df.sort_values(by='date_of_last_hoorah_given')
mask = df.date_of_last_hoorah_given.isnull()
print (pd.concat([df[mask], df[~mask]]))
    date_of_last_hoorah_given                              email   first_name
1                         NaT  [email protected]  Flitterbeam
16 2016-12-19 07:36:08.000000              [email protected]        Mindy
29 2016-12-19 07:36:08.000000              [email protected]         Judi
7  2016-12-19 07:36:08.000000                  [email protected]         Chao
21 2016-12-19 07:36:08.000000              [email protected]         Bala
12 2016-12-19 07:36:08.000000            [email protected]       Pushpa
5  2016-12-22 07:36:08.000000                  [email protected]         Chen
6  2016-12-22 07:36:08.000000             [email protected]     Guanting
13 2016-12-22 07:36:08.000000            [email protected]      Beverly
3  2016-12-22 07:36:08.000000             [email protected]        Colby
11 2016-12-22 07:36:08.000000        [email protected]      Shirley
0  2016-12-22 07:36:08.000000         [email protected]       Aditya
14 2016-12-22 07:36:08.000000                [email protected]         Arlo
2  2016-12-22 07:36:08.000000             [email protected]      Minerva
19 2016-12-22 07:36:08.000000              [email protected]       Jiahao
23 2016-12-22 07:36:08.000000            [email protected]       Shreya
15 2016-12-22 07:36:08.000000            [email protected]       Janine
25 2016-12-22 07:36:08.000000       [email protected]    Sweetgaze
27 2016-12-22 07:36:08.000000     [email protected]  Twinklenose
28 2016-12-22 07:36:08.000000         [email protected]      Sanjeev
30 2016-12-22 07:36:08.000000       [email protected]      Sparrow
18 2016-12-22 10:55:03.474683                  [email protected]          Fen
9  2016-12-23 07:36:08.000000             [email protected]     Kourtney
10 2016-12-23 14:30:55.206581             [email protected]       Kailee
4  2016-12-24 07:36:08.000000                [email protected]         Jing
...
...

0
投票

如果您来自 Google 现在有一个 kwarg 可以准确处理这个问题,称为

na_position

df.sort_values(
    ['column1', 'column2'], # columns you want to sort by
    na_position='first'
)

这会将您的空值放在前面(而不是默认放在最后)。

© www.soinside.com 2019 - 2024. All rights reserved.