更改 dtype 后，pandas 数据帧上的内存使用情况发生奇怪的变化

Question

这是您的帖子的英文翻译，您可以使用它与之前翻译的代码一起在 Stack Overflow 上提问：

你好。我正在从网络收集大规模数据并将其存储在 pandas DataFrames 中。在此过程中，我遇到了显着的内存使用情况，并尝试执行内存优化。

内存优化是通过更改每列的数据类型以匹配其保存的数据类型来进行的。例如，由 1 到 5 范围内的整数组成的

rating_***

列被转换为

int8

。然而，在将我的内存优化功能应用于收集的数据后（由于一些异常，我创建了两种类型），结果出乎意料。

这些功能被设计为不就地操作，我什至使用了
```
copy(deep=True)
```
以防万一。尽管如此，将这些函数应用于原始 DataFrame (
```
df_test
```
)（它不会就地操作并返回具有优化内存的新 DataFrame）会改变
```
df_test
```
的内存使用情况。
为了研究这个问题，我在运行代码之前创建了
```
df_backup
```
的深层副本 (
```
df_test
```
)，并重复了相同的实验。这次，
```
df_backup
```
的内存使用也发生了变化。
怀疑可能是代码执行环境的问题，我在两个不同的环境下进行了测试。同样的问题出现在 Windows 10 & Python 3.11 & pandas 2.1.3 以及 Linux & Python 3.10 & pandas 1.5.3 (Google Colab) 上。
无论我使用
```
df.memory_usage(deep=True)
```
还是
```
sys.getsizeof(df)
```
测量内存使用情况，都会观察到相同的结果。

我花了 4 个小时试图解决这个问题，但没有成功。

可能是什么问题？

为什么优化后内存占用会增加？

import pandas as pd
import numpy as np


# ----- Define memory optimize functions -----


def squeeze_memory(raw_df):
    df = raw_df.copy()
    # Change rating columns to int8
    for col in ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']:
        df[col] = df[col].astype('int8')
        
    # Convert the 'date_written' to datetime
    df['date_written'] = pd.to_datetime(df['date_written'], errors='coerce')
    
    # Convert columns with a small number of unique values to category
    category_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
    for col in category_cols:
#         print(col, len(df[col].unique()))
        df[col] = df[col].astype('category')
    
    return df



def optimize_memory(df):
    # Internally define the optimal types for columns
    cat_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
    int_cols = ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']
    dt_cols = ['date_written']
    
    # Create an empty dataframe
    my_df = pd.DataFrame()

    for col in df.columns:
        # Optimize column types
        if col in cat_cols:
            my_df[col] = df[col].copy().astype('category')
        elif col in int_cols:
            # Rating columns are sufficient with int8
            my_df[col] = df[col].copy().astype('int8')
        elif col in dt_cols:
            # Convert date columns to datetime, specify format
            my_df[col] = pd.to_datetime(df[col].copy(), format='%Y. %m', errors='coerce')
        else:
            # Check if it's possible to reduce memory usage for the other columns
            if df[col].copy().dtype == 'float64':
                # Change float64 columns to float32
                my_df[col] = df[col].copy().astype('float32')
            elif df[col].copy().dtype == 'object':
                # Optimize object type columns considering the length of strings and the number of unique values
                num_unique_values = df[col].copy().nunique()
                num_total_values = len(df[col].copy())
                if num_unique_values / num_total_values < 0.5:
                    # If the ratio of unique values to total values is low, convert to category type
                    my_df[col] = df[col].copy().astype('category')
                else:
                    # Otherwise, maintain the existing type
                    my_df[col] = df[col].copy()
            else:
                # In other cases, maintain the existing type
                my_df[col] = df[col].copy()

    return my_df


# ----- Set dataframe for test -----

# Constructing a random df_test DataFrame
np.random.seed(0)
df_test = pd.DataFrame({
    'company_name': np.random.choice(['Company A', 'Company B', 'Company C', 'Company D'], size=2000),
    'industry': np.random.choice(['IT', 'Service', 'Manufacturing', 'Finance'], size=2000),
    'job': np.random.choice(['Development', 'Marketing', 'HR', 'Sales'], size=2000),
    'employment_status': np.random.choice(['Employed', 'Resigned'], size=2000),
    'location': np.random.choice(['Seoul', 'Busan', 'Daegu', 'Gwangju'], size=2000),
    'date_written': np.random.choice(pd.date_range(start='2005-01-01', periods=7000, freq='D').astype(str), size=2000),
    'rating_overall': np.random.randint(1, 6, size=2000),
    'rating_promotion_opportunities': np.random.randint(1, 6, size=2000),
    'rating_welfare_and_salary': np.random.randint(1, 6, size=2000),
    'rating_work_life_balance': np.random.randint(1, 6, size=2000),
    'rating_corporate_culture': np.random.randint(1, 6, size=2000),
    'rating_management': np.random.randint(1, 6, size=2000),
    'title': [f'Title_{i}'*250 for i in range(2000)],
    'pros': [f'Pros_{i}'*250 for i in range(2000)],
    'cons': [f'Cons_{i}'*250 for i in range(2000)],
    'advice_to_management': [f'Advice_{i}'*200 for i in range(2000)],
    'one_year_later': np.random.choice(['Yes', 'No'], size=2000),
    'recommendation': np.random.choice(['Recommend', 'Do Not Recommend'], size=2000)
})



# ----- Run optimization and check memory usages -----

# Code to check memory usage
def memory_usage_of_dataframe(my_df):
    mem_usage = my_df.memory_usage(deep=True).sum()
#     mem_usage = sys.getsizeof(my_df)
    return mem_usage


df_backup = df_test.copy(deep=True)


print('df_test:', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))


# Original DataFrame's memory usage
original_memory = memory_usage_of_dataframe(df_test)

# Applying the optimize_memory function
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))

# Applying the squeeze_memory function
squeezed_memory = memory_usage_of_dataframe(squeeze_memory(df_test))

print( f"1st measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}" )




# Original DataFrame's memory usage
original_memory = memory_usage_of_dataframe(df_test)

# Applying the optimize_memory function
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))

# Applying the squeeze_memory function
squeezed_memory = memory_usage_of_dataframe(squeeze_memory(df_test))

print( f"2nd measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}" )


print('df_test:', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))


# ----- Result -----

df_test: 26576634
df_backup: 26576634
1st measure: 26,576,634 | 45,045,679 | 45,045,679
2nd measure: 46,533,004 | 45,045,679 | 45,045,679
df_test: 46533004
df_backup: 46533004

我已经尽力了

Answer 1

您的代码已经减少了内存使用。通过强制转换为 the string[pyarrow] 类型字符串列，仍有一些改进空间。在这里我添加改进：

def squeeze_memory(raw_df):
    df = raw_df.copy()
    # Change rating columns to int8
    for col in ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']:
        df[col] = df[col].astype('int8')
        
    # Convert the 'date_written' to datetime
    df['date_written'] = pd.to_datetime(df['date_written'], errors='coerce')
    
    # Convert columns with a small number of unique values to category
    category_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
    for col in category_cols:
        df[col] = df[col].astype('category')

    for col in ['title', 'pros', 'cons', 'advice_to_management']:
        df[col] = df[col].astype('string[pyarrow]')
    
    return df

您可以使用单个循环逐列对缩减进行基准测试：

sm_df = squeeze_memory(df_test)

benchmark = []
for col in sm_df.columns:
    col_mem = memory_usage_of_dataframe(sm_df.loc[:,[col]])
    orig_mem = memory_usage_of_dataframe(df_test.loc[:,[col]])
    benchmark.append([col, sm_df[col].dtype, col_mem,df_test[col].dtype, orig_mem, orig_mem-col_mem])

pd.DataFrame(benchmark, columns=["column name", "new type", "new mem size", "old type", "old mem size", "mem improvement"])

所有列肯定都减少了。尽管它不能创造奇迹，因为您存储的字符串非常长，这占用了大量的空间。如果这还不够，我建议尝试像polls这样的东西来测试是否可以提高您想要的性能。

更改 dtype 后，pandas 数据帧上的内存使用情况发生奇怪的变化

问题描述投票：0回答：1

1个回答

最新问题

更改 dtype 后，pandas 数据帧上的内存使用情况发生奇怪的变化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1