Python matplotlib直方图非常慢

问题描述 投票:-1回答:2

我试图为.csv文件中的数据绘制直方图。但是当我运行它时,它非常慢。我等了20分钟,但仍然无法得到情节。请问那是问题吗?

以下几行是我的代码。

import pandas as pd
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy.values[ :, 5 ]

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()
python pandas matplotlib histogram
2个回答
0
投票

我做了以下,似乎这可以解决问题。

似乎“stock_price_spy = spy ['Adj Close'] .values”给出了一个真正的numpy ndarray。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy[ 'Adj Close' ].values

plt.hist( stock_price_spy, bins = 100, label = 'S&P 500 ETF', alpha = 0.8 )
plt.show()

-6
投票

事实上,你正在使用一种非常不足的方式来实现你的目标,你需要使用numpy来提高性能。

import numpy as np
import matplotlib.pyplot as plt

stock_price_spy = np.loadtxt('SPY.csv', dtype=float, delimiter=',', skiprows=1, usecols=4)

#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

我没有测试它,但它应该工作。

我建议你使用英特尔的优化版python。管理这种流程更好。 Intel python distribution

Adding code for testing. Because some fellows are trying to misinform and are missing true arguments, panda uses Dataframes which are dictionaries, not numpy arrays. And numpy arrays are almost twice faster.

import numpy as np
import pandas as pd
import random
import csv
import matplotlib.pyplot as plt
import time

#Creating a random csv file 6 x 4871, simulating the problem.
rows = 4871
columns = 6
fields = ['one', 'two', 'three', 'four', 'five', 'six']

write_a_csv = csv.DictWriter(open("random.csv", "w"), 
fieldnames=fields)
for i in range(0, rows):
    write_a_csv.writerow(dict([
    ('one', random.random()),
    ('two', random.random()),
    ('three', random.random()),
    ('four', random.random()),
    ('five', random.random()),
    ('six', random.random())
    ]))

start_old = time.clock()
spy = pd.read_csv( 'random.csv' )
print(type(spy))
stock_price_spy = spy.values[ :, 5 ]
n, bins, patches = plt.hist( stock_price_spy, 50 )

plt.show()
end_old = time.clock()
total_time_old = end_old - start_old
print(total_time_old)

start_new = time.clock()

stock_price_spy_new = np.loadtxt('random.csv', dtype=float, 
delimiter=',', skiprows=1, usecols=4)
print(type(stock_price_spy_new))
#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy_new, 50 )
plt.show()
end_new = time.clock()

total_time_new = end_new - start_new
print(total_time_new)
© www.soinside.com 2019 - 2024. All rights reserved.