如何解决内核错误或内存错误？

Question

我有和长度为50000的字符串数组。我正在尝试创建一个维度为50000 * 500000的相似性矩阵。为了使它我尝试使用以下代码形成元组列表：

terms = [element for element in itertools.product(array1,array1)]

但我得到内存错误或内核错误。它无法向前发展。

我也在堆栈溢出中跟随这个问题：Spark Unique pair in cartesian product这与我计算距离的实现非常相似（由于对称性，我可以利用矩阵中的上三角或下三角）。有没有办法通过火花或任何其他方式使用分区或其他方式完成它。任何想法将不胜感激。

小阵列的玩具示例：

array1 = np.array(['hello', 'world', 'thankyou'])
terms = [element for element in itertools.product(array1,array1)]

术语输出：

[('hello', 'hello'),
 ('hello', 'world'),
 ('hello', 'thankyou'),
 ('world', 'hello'),
 ('world', 'world'),
 ('world', 'thankyou'),
 ('thankyou', 'hello'),
 ('thankyou', 'world'),
 ('thankyou', 'thankyou')]

Answer 1

50000 * 50000是列表中的2GB +元素。每个列表元素占用4个字节（列表的开销为+36字节）。将其乘以平均字符串长度（在您的示例中为6）+ 21（每个字符串的字节数）。这意味着您需要为此单个语句提供216 GB以上的RAM（这是您的操作系统，其他程序等的内存之上）。我认为你正在达到现实世界的局限，需要找到更好的算法。

如何解决内核错误或内存错误？

问题描述投票：-2回答：1

1个回答

最新问题

如何解决内核错误或内存错误？

问题描述 投票：-2回答：1

1个回答

最新问题

问题描述投票：-2回答：1