在Pandas数据框中发现排列的内存问题

Question

我正在尝试根据另一个数据框的排列来制作新的数据框。这是原始数据框。价格是指数。

df1
Price     Bid   Ask
1          .01   .05
2          .04   .08
3          .1    .15  
.           .      .
130        2.50  3.00

第二个数据框旨在从df1获取索引，并基于4个价格（例如下面的示例输出）创建一个包含df1索引排列的数据框（df2）。

df2
 #     price1   price2   price 3  price 4
 1       1        2         3       4
 2       1        2         3       5
 3       1        2         3       6
 ..       ..       ..        ..      ..

为了实现这一点，我一直在使用itertools.permutation，但是我遇到了内存问题，无法执行大量的排列。这是我用来进行排列的代码。

price_combos = list(x for x in itertools.permutations(df1.index, 4))
df2 = pd.DataFrame(price_combos , columns=('price1', 'price2', 'price3', 'price4'))

Answer 1

dtypes可能导致内存分配膨胀。
- 对于您的情况，我发现的最好的方法是将排列加载到具有最小np.array（例如dtype）的np.int6中。
- 如果在创建数据框时未指定dtype，则dtype将为int64
[将price_combos保留为列表，然后在创建数据帧（dtype）时强制转换pd.DataFrame(price_combos, dtype='int8')，比在dtype中设置np.array的时间长2.7倍，如下所示。] >
如果您使用的是Jupyter，它的内存管理非常糟糕。
- 如果选择使用IDE（例如Spyder，PyCharm，VSCode），请这样做。

import numpy as np
import pandas a pd
from itertools import permutations

# synthetic data set and create dataframe
np.random.seed(365)
data = {'Price': list(range(1, 131)),
        'Bid': [np.random.randint(1, 10)*0.1 for _ in range(130)]}

df1 = pd.DataFrame(data)
df1['Ask'] = df1.Bid + 0.15
df1.set_index('Price', inplace=True)

# create price_combos
%%time
price_combos = np.array(list(permutations(df1.index, 4)), dtype=np.int8)
>>> Wall time: 2min 33s

# memory usage for price_combos is 1_090_452_592

price_combos[:6]
>>> array([[1, 2, 3, 4],
           [1, 2, 3, 5],
           [1, 2, 3, 6],
           [1, 2, 3, 7],
           [1, 2, 3, 8],
           [1, 2, 3, 9]], dtype=int8)

# create df2
%%time
df2 = pd.DataFrame(price_combos , columns=('price1', 'price2', 'price3', 'price4')) 
>>> Wall time: 2.03 ms

print(df2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272613120 entries, 0 to 272613119
Data columns (total 4 columns):
 #   Column  Dtype
---  ------  -----
 0   price1  int8 
 1   price2  int8 
 2   price3  int8 
 3   price4  int8 
dtypes: int8(4)
memory usage: 1.0 GB
df2.head()

   price1  price2  price3  price4
0       1       2       3       4
1       1       2       3       5
2       1       2       3       6
3       1       2       3       7
4       1       2       3       8
df2.tail()

           price1  price2  price3  price4
272613115    -126    -127    -128     123
272613116    -126    -127    -128     124
272613117    -126    -127    -128     125
272613118    -126    -127    -128     126
272613119    -126    -127    -128     127

在Pandas数据框中发现排列的内存问题

问题描述投票：2回答：1

1个回答

`df2.head()`

`df2.tail()`

最新问题

在Pandas数据框中发现排列的内存问题

问题描述 投票：2回答：1

1个回答

df2.head()

df2.tail()

最新问题

问题描述投票：2回答：1

`df2.head()`

`df2.tail()`