在尝试使用隔离林方法检测异常时,我收到所需数据集的错误。但是我有另一个完全不同的数据集,它工作得很好,什么可能导致这个问题?
isolationforest Model Build progress: | (failed) | 0% Traceback (most recent call last): File
"h2o_test.py", line 149, in <module> isoforest.train(x=iso_forest.col_names[0:65],
training_frame=iso_forest) File "/home/ec2-user/.local/lib/python3.7/site-
packages/h2o/estimators/estimator_base.py", line 107, in train self._train(parms,
verbose=verbose) File "/home/ec2-user/.local/lib/python3.7/site-
packages/h2o/estimators/estimator_base.py", line 199, in _train
job.poll(poll_updates=self._print_model_scoring_history if verbose else None) File
"/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 89, in poll
"\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) OSError: Job with key
$03017f00000132d4ffffffff$_92ee3e892f7bc86460e80153eaec4b70 failed with an exception:
java.lang.AssertionError stacktrace: java.lang.AssertionError at
hex.tree.DHistogram.init(DHistogram.java:350) at
hex.tree.DHistogram.init(DHistogram.java:343) at
hex.tree.ScoreBuildHistogram2$ComputeHistoThread.computeChunk(ScoreBuildHistogram2.java:427)
at hex.tree.ScoreBuildHistogram2$ComputeHistoThread.map(ScoreBuildHistogram2.java:408) at
water.LocalMR.compute2(LocalMR.java:89) at water.LocalMR.compute2(LocalMR.java:81) at
water.H2O$H2OCountedCompleter.compute(H2O.java:1704) at
jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at
jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at
jsr166y.ForkJoinPool$WorkQueue.popAndExecAll(ForkJoinPool.java:906) at
jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979) at
jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at
jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
with open('/home/webapp/flask-api/tmp_rows/temp_file2.csv', 'w+') as tmp_file:
temp_name = "/tmp_rows/temp_file2.csv"
tmp_file.write(text_stream.getvalue())
tmp_file.close()
h2o.init()
print("TEMP_nAME", temp_name)
iso_forest = h2o.import_file('/home/webapp/flask-api/{0}'.format(temp_name))
seed = 12345
ntrees = 100
isoforest = h2o.estimators.H2OIsolationForestEstimator(
ntrees=ntrees, seed=seed)
isoforest.train(x=iso_forest.col_names[0:65], training_frame=iso_forest)
predictions = isoforest.predict(iso_forest)
print(predictions)
h2o.cluster().shutdown()
CSV 创建正常,所以似乎没有问题,是什么导致了这个 Java 错误?我什至增加了 ec2 的大小以获得更多 RAM,但这也没有解决问题。
我猜这会得到接近的投票,因为这将是导致问题的数据,但没有给出数据。但也许你的数据无法给出,或者数据太多。
因此,我建议尝试仅使用数据的前半部分/后半部分,如果只有一个或另一个触发它,则继续重复,看看是否可以将其缩小到只有一行。
对于列也是如此,例如一次尝试 10-15 列,看看是否只是一列,或者可能是某些类型的列,触发了它。
当然,一旦有了这个,你也就有了解决方案:排除麻烦的列/行。 但您也有足够的时间向 H2O 提交错误报告(看起来可以在 https://github.com/h2oai/h2o-3/issues)