这是您的帖子的英文翻译,您可以使用它与之前翻译的代码一起在 Stack Overflow 上提问:
你好。我正在从网络收集大规模数据并将其存储在 pandas DataFrames 中。在此过程中,我遇到了显着的内存使用情况,并尝试执行内存优化。
内存优化是通过更改每列的数据类型以匹配其保存的数据类型来进行的。例如,由 1 到 5 范围内的整数组成的
rating_***
列被转换为 int8
。然而,在将我的内存优化功能应用于收集的数据后(由于一些异常,我创建了两种类型),结果出乎意料。
这些功能被设计为不就地操作,我什至使用了
copy(deep=True)
以防万一。尽管如此,将这些函数应用于原始 DataFrame (df_test
)(它不会就地操作并返回具有优化内存的新 DataFrame)会改变 df_test
的内存使用情况。
为了研究这个问题,我在运行代码之前创建了
df_backup
的深层副本 (df_test
),并重复了相同的实验。这次,df_backup
的内存使用也发生了变化。
怀疑可能是代码执行环境的问题,我在两个不同的环境下进行了测试。同样的问题出现在 Windows 10 & Python 3.11 & pandas 2.1.3 以及 Linux & Python 3.10 & pandas 1.5.3 (Google Colab) 上。
无论我使用
df.memory_usage(deep=True)
还是 sys.getsizeof(df)
测量内存使用情况,都会观察到相同的结果。
我花了 4 个小时试图解决这个问题,但没有成功。
可能是什么问题?
为什么优化后内存占用会增加?
import pandas as pd
import numpy as np
# ----- Define memory optimize functions -----
def squeeze_memory(raw_df):
df = raw_df.copy()
# Change rating columns to int8
for col in ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']:
df[col] = df[col].astype('int8')
# Convert the 'date_written' to datetime
df['date_written'] = pd.to_datetime(df['date_written'], errors='coerce')
# Convert columns with a small number of unique values to category
category_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
for col in category_cols:
# print(col, len(df[col].unique()))
df[col] = df[col].astype('category')
return df
def optimize_memory(df):
# Internally define the optimal types for columns
cat_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
int_cols = ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']
dt_cols = ['date_written']
# Create an empty dataframe
my_df = pd.DataFrame()
for col in df.columns:
# Optimize column types
if col in cat_cols:
my_df[col] = df[col].copy().astype('category')
elif col in int_cols:
# Rating columns are sufficient with int8
my_df[col] = df[col].copy().astype('int8')
elif col in dt_cols:
# Convert date columns to datetime, specify format
my_df[col] = pd.to_datetime(df[col].copy(), format='%Y. %m', errors='coerce')
else:
# Check if it's possible to reduce memory usage for the other columns
if df[col].copy().dtype == 'float64':
# Change float64 columns to float32
my_df[col] = df[col].copy().astype('float32')
elif df[col].copy().dtype == 'object':
# Optimize object type columns considering the length of strings and the number of unique values
num_unique_values = df[col].copy().nunique()
num_total_values = len(df[col].copy())
if num_unique_values / num_total_values < 0.5:
# If the ratio of unique values to total values is low, convert to category type
my_df[col] = df[col].copy().astype('category')
else:
# Otherwise, maintain the existing type
my_df[col] = df[col].copy()
else:
# In other cases, maintain the existing type
my_df[col] = df[col].copy()
return my_df
# ----- Set dataframe for test -----
# Constructing a random df_test DataFrame
np.random.seed(0)
df_test = pd.DataFrame({
'company_name': np.random.choice(['Company A', 'Company B', 'Company C', 'Company D'], size=2000),
'industry': np.random.choice(['IT', 'Service', 'Manufacturing', 'Finance'], size=2000),
'job': np.random.choice(['Development', 'Marketing', 'HR', 'Sales'], size=2000),
'employment_status': np.random.choice(['Employed', 'Resigned'], size=2000),
'location': np.random.choice(['Seoul', 'Busan', 'Daegu', 'Gwangju'], size=2000),
'date_written': np.random.choice(pd.date_range(start='2005-01-01', periods=7000, freq='D').astype(str), size=2000),
'rating_overall': np.random.randint(1, 6, size=2000),
'rating_promotion_opportunities': np.random.randint(1, 6, size=2000),
'rating_welfare_and_salary': np.random.randint(1, 6, size=2000),
'rating_work_life_balance': np.random.randint(1, 6, size=2000),
'rating_corporate_culture': np.random.randint(1, 6, size=2000),
'rating_management': np.random.randint(1, 6, size=2000),
'title': [f'Title_{i}'*250 for i in range(2000)],
'pros': [f'Pros_{i}'*250 for i in range(2000)],
'cons': [f'Cons_{i}'*250 for i in range(2000)],
'advice_to_management': [f'Advice_{i}'*200 for i in range(2000)],
'one_year_later': np.random.choice(['Yes', 'No'], size=2000),
'recommendation': np.random.choice(['Recommend', 'Do Not Recommend'], size=2000)
})
# ----- Run optimization and check memory usages -----
# Code to check memory usage
def memory_usage_of_dataframe(my_df):
mem_usage = my_df.memory_usage(deep=True).sum()
# mem_usage = sys.getsizeof(my_df)
return mem_usage
df_backup = df_test.copy(deep=True)
print('df_test:', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))
# Original DataFrame's memory usage
original_memory = memory_usage_of_dataframe(df_test)
# Applying the optimize_memory function
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))
# Applying the squeeze_memory function
squeezed_memory = memory_usage_of_dataframe(squeeze_memory(df_test))
print( f"1st measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}" )
# Original DataFrame's memory usage
original_memory = memory_usage_of_dataframe(df_test)
# Applying the optimize_memory function
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))
# Applying the squeeze_memory function
squeezed_memory = memory_usage_of_dataframe(squeeze_memory(df_test))
print( f"2nd measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}" )
print('df_test:', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))
# ----- Result -----
df_test: 26576634
df_backup: 26576634
1st measure: 26,576,634 | 45,045,679 | 45,045,679
2nd measure: 46,533,004 | 45,045,679 | 45,045,679
df_test: 46533004
df_backup: 46533004
我已经尽力了
您的代码已经减少了内存使用。通过强制转换为 the string[pyarrow] 类型字符串列,仍有一些改进空间。在这里我添加改进:
def squeeze_memory(raw_df):
df = raw_df.copy()
# Change rating columns to int8
for col in ['rating_overall', 'rating_promotion_opportunities', 'rating_welfare_and_salary', 'rating_work_life_balance', 'rating_corporate_culture', 'rating_management']:
df[col] = df[col].astype('int8')
# Convert the 'date_written' to datetime
df['date_written'] = pd.to_datetime(df['date_written'], errors='coerce')
# Convert columns with a small number of unique values to category
category_cols = ['company_name', 'industry', 'job', 'employment_status', 'location', 'one_year_later', 'recommendation']
for col in category_cols:
df[col] = df[col].astype('category')
for col in ['title', 'pros', 'cons', 'advice_to_management']:
df[col] = df[col].astype('string[pyarrow]')
return df
您可以使用单个循环逐列对缩减进行基准测试:
sm_df = squeeze_memory(df_test)
benchmark = []
for col in sm_df.columns:
col_mem = memory_usage_of_dataframe(sm_df.loc[:,[col]])
orig_mem = memory_usage_of_dataframe(df_test.loc[:,[col]])
benchmark.append([col, sm_df[col].dtype, col_mem,df_test[col].dtype, orig_mem, orig_mem-col_mem])
pd.DataFrame(benchmark, columns=["column name", "new type", "new mem size", "old type", "old mem size", "mem improvement"])
所有列肯定都减少了。尽管它不能创造奇迹,因为您存储的字符串非常长,这占用了大量的空间。如果这还不够,我建议尝试像polls这样的东西来测试是否可以提高您想要的性能。