H2o Cluster Resources shared issue while on XGBoosts model process

问题描述 投票:0回答:0

我已经为我的模型使用了 XGBoost。我注意到 h2o 集群在此模型过程中不共享内存。 master A 服务器 RAM 利用率非常高,而 master B RAM 利用率非常低。我检查了两台服务器上的 h2o 日志,发现主 A 日志文件在模型处理时不断更新,但主 B 日志文件没有更新。它仅显示集群创建的日志

有时在模型上处理 master A h2o jar 由于内存使用率高而宕机。

我正在使用 h2o-3.36.1.1 版本并创建了两个节点集群。集群已成功创建并在日志文件中记录集群详细信息。

我检查了主控 A 和 B 的连通性,并在两侧进行了卷曲。一切正常,集群运行良好。

  • H2O_cluster_uptime:15 分 14 秒
  • H2O_cluster_timezone:亚洲/科伦坡
  • H2O_data_parsing_timezone:UTC
  • H2O_cluster_version:3.36.1.1
  • H2O_cluster_version_age:11 个月零 28 天!!!
  • H2O_cluster_name:XXXXXX
  • H2O_cluster_total_nodes:2
  • H2O_cluster_free_memory:43.36 Gb
  • H2O_cluster_total_cores:30
  • H2O_cluster_allowed_cores:30
  • H2O_cluster_status:锁定,
  • healthy H2O_connection_url: http://localhost:54321
  • H2O_connection_proxy:{“http”:空,“https”:空}
  • H2O_internal_security:假
  • Python_version: 3.7.11 final

谁能帮我解决这些问题。

为什么两个服务器在模型处理时不共享服务器资源?

为什么大师 B h2o 日志不更新?

为什么 master A h2o jar down 内存占用高?

大师A日志

            main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
        main  INFO water.default: 
   FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
  058452-166  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-166  INFO water.default: Locking cloud to new members, because water.api.schemas3.MetadataV3
  4058452-14  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  4058452-15  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-18  INFO water.default: POST /4/sessions, parms: {}
  4058452-16  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_a391}
  4058452-13  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-13  INFO water.default: Removing all objects
  4058452-13  INFO water.default: Finished removing objects
  4058452-12  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-12  INFO water.default: Removing all objects
  4058452-12  INFO water.default: Finished removing objects
  058452-170  INFO water.default: DELETE /3/DKV, parms: {}
  058452-170  INFO water.default: Removing all objects
  058452-170  INFO water.default: Finished removing objects
  4058452-14  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-169  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  058452-166  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-19  INFO water.default: POST /4/sessions, parms: {}
  4058452-18  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_bfac}
  058452-170  INFO water.default: Reading byte InputStream into Frame:
  058452-170  INFO water.default:     frameKey:    upload_bbcd4f6aeb3c1095e63f66a89cdd4756
  058452-170  INFO water.default:     totalChunks: 2
  058452-170  INFO water.default:     totalBytes:  4404663
  058452-170  INFO water.default:     Success.
  058452-167  INFO water.default: POST /3/ParseSetup, parms: {single_quotes=False, source_frames=["upload_bbcd4f6aeb3c1095e63f66a89cdd4756"], check_header=0}
  058452-169  INFO water.default: Total file size: 4.2 MB
  058452-169  INFO water.default: Parse chunk size 4194304
     FJ-1-15  INFO water.default: Parse result for Key_Frame__upload_bbcd4f6aeb3c1095e63f66a89cdd4756.hex (2023 rows, 436 columns):
     FJ-1-15  INFO water.default:                               ColV2    type          min          max         mean        sigma         NAs constant cardinality
     FJ-1-15  INFO water.default:                                COL1:  factor    011022232    YA9854024                                                  1334
     FJ-1-15  INFO water.default:                      COL2: numeric      2019.00      2020.00      2019.70     0.457960                            
     FJ-1-15  INFO water.default:                     COL3: numeric      1.00000      12.0000      6.07860      2.82287                            
     FJ-1-15  INFO water.default:                         COL4:  factor |00011000813 |09988000074                                                  1334
     FJ-1-15  INFO water.default:                       COL5:  factor    CUST NAME     CUSTOMER                                                     2
     FJ-1-15  INFO water.default:                         COL6: numeric  1.14005e+08  4.10024e+08  2.96146e+08  4.57328e+07                            
     FJ-1-15  INFO water.default:                    COL7: numeric      10000.0      30000.0      28294.6      5573.93           3                
     FJ-1-15  INFO water.default:                     COL8:  factor                       USD                                                     4
     FJ-1-15  INFO water.default:                              COL9:  factor          927         RM17                                                    20
     FJ-1-15  INFO water.default:               COL10:  factor           NO          YES                                                     2
     FJ-1-15  INFO water.default: Additional column information only sent to log file...
     FJ-1-15  INFO water.default:                COL11: numeric     -1.00000      175.250      1.07602      5.07740                            
     FJ-1-15  INFO water.default:                COL12: numeric     -1.00000      97.2262     0.447662      3.19167                            
     FJ-1-15  INFO water.default:                COL13: numeric     -1.00000      124.206      1.03933      3.94221                            
     FJ-1-15  INFO water.default:                      response_class:  factor           1A to_be_filled                                                     5
     FJ-1-15  INFO water.default:                    response_class_5:  factor           1B          1B1                                                     2
     FJ-1-15  INFO water.default:                    response_class_4:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_3:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_2:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_1:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                              subset:  factor         test        train                                                     2
     FJ-1-15  INFO water.default: Chunk compression summary:
     FJ-1-15  INFO water.default:   Chunk Type                 Chunk Name       Count  Count Percentage        Size  Size Percentage
     FJ-1-15  INFO water.default:          C0L              Constant long          74           8.486 %      5.8 KB          0.207 %
     FJ-1-15  INFO water.default:          CBS                     Binary          19           2.179 %      4.4 KB          0.159 %
     FJ-1-15  INFO water.default:          CXI            Sparse Integers          80           9.174 %     25.0 KB          0.897 %
     FJ-1-15  INFO water.default:          CXF               Sparse Reals          50           5.734 %     48.9 KB          1.753 %
     FJ-1-15  INFO water.default:           C1            1-Byte Integers           7           0.803 %     11.8 KB          0.423 %
     FJ-1-15  INFO water.default:          C1N  1-Byte Integers (w/o NAs)          92          10.550 %    104.0 KB          3.731 %
     FJ-1-15  INFO water.default:          C1S           1-Byte Fractions         142          16.284 %    118.4 KB          4.245 %
     FJ-1-15  INFO water.default:           C2            2-Byte Integers          72           8.257 %    231.7 KB          8.309 %
     FJ-1-15  INFO water.default:          C2S           2-Byte Fractions          18           2.064 %     22.9 KB          0.822 %
     FJ-1-15  INFO water.default:           C4            4-Byte Integers          50           5.734 %    109.1 KB          3.913 %
     FJ-1-15  INFO water.default:          C4S           4-Byte Fractions         127          14.564 %    360.5 KB         12.925 %
     FJ-1-15  INFO water.default:           C8            8-byte Integers           1           0.115 %     15.0 KB          0.539 %
     FJ-1-15  INFO water.default:          CUD               Unique Reals           5           0.573 %     13.2 KB          0.472 %
     FJ-1-15  INFO water.default:          C8D               64-bit Reals         135          15.482 %      1.7 MB         61.606 %
     FJ-1-15  INFO water.default: Frame distribution summary:
     FJ-1-15  INFO water.default:                             Size  Number of Rows  Number of Chunks per Column  Number of Chunks

B大师

    main  INFO water.default: H2O started in 4906ms
     main  INFO water.default: 
     main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
     main  INFO water.default: 
FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
FJ-123-15  INFO water.default: Locking cloud to new members, because Class Id=56
  FJ-2-15  INFO water.default: Key upload_bbcd4f6aeb3c1095e63f66a89cdd4756 will be parsed using method DistributedParse.
  FJ-2-21  INFO water.default: Key upload_902bcdd31a4aea9f65690f1bc6074886 will be parsed using method DistributedParse.
linux data-science xgboost h2o h2o.ai
© www.soinside.com 2019 - 2024. All rights reserved.