使用 IDE / pyspark shell 配置 log4j 以使用属性文件登录到控制台和文件

问题描述 投票:0回答:1

在笔记本电脑(Ubuntu 盒子)上的开发环境中,我编写“pyspark 代码”并通过

pytest
运行它进行测试。我将使用
SparkSession.builder.getOrCreate()
创建的 Spark 会话注入到我的测试中。这个“pyspark代码”有一堆信息级别的日志语句。它们都没有出现在任何地方,所以我被迫写下
print()
,后来我将其删除。

我想要的只是:

  • 能够为我的所有记录器设置日志级别 (
    My.*
    )
  • 写入控制台和文件(附加程序)
  • 一些描述格式规范的文档。
    %t, %m, %n, %d{YYYY-MM-DD}, ...

我读到:

    官方文档中的
  • 配置日志记录。它只是说添加
    log4j2.properties
    。找不到任何有关
    log4j2.properties
    格式的文档。 样本
    log4j2.properties.template
    非常有限。 注意,它没有说关于使用
    xml
    或其他格式,只是
    properties
  • 所有官方 log4j2 文档都指向同一个地方,其中没有关于
    log4j2.properties
    内容的文档。它提供了一些样本,但这些样本没有用,例如在哪里指定文件附加器的文件名(不是滚动文件附加器)。
  • 我什至尝试了log4j(不是log4j2)文档,这显然不起作用(
    ConfigurationException: No type attribute provided for Appender my_console
    )。

我不想通过反复试验来弄清楚。只是寻找文档。

my_spark_module.py

class MySparkModule:
    def __init__(self, spark):
        self.log = spark.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.SparkModule')
        self.df = spark.createDataFrame([('data1', '2022-01-01T00:00:00'), ('data2', '2022-02-01T00:00:00')],
            schema=['data', 'ts_column'])
        self.log.info('exiting init()')

    def get_count(self):
        return self.df.count()

test_my_spark_module.py

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(autouse=True, scope='module')
def spark_session():
    return SparkSession.builder.getOrCreate()


class TestMySparkModule:
    def test_tables(self):
        spark_session = SparkSession.builder.getOrCreate()
        log = spark_session.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.TestMySparkModule')
        log.info('something I wanted to log')
        
        assert MySparkModule(spark_session).get_count() == 1, 'expected count to be 1'
python java logging pyspark log4j
1个回答
0
投票

有用的链接:


要将 Spark 会话中的 log4j2 日志设置到

/tmp/my-spark.log
和标准输出,请将此文件放在
$SPARK_HOME/conf/log4j2.properties
中。

我们使用诗歌虚拟环境,所以

SPARK_HOME
<project-root>/.venv/lib/python3.10/site-packages/pyspark/
并且它没有
conf
文件夹。 所以我创建了conf文件夹并创建了
<project-root>/.venv/lib/python3.10/site-packages/pyspark/conf/log4j2.properties

使用

<project-root>/.venv/bin/pyspark
启动的任何 Spark 会话都将使用此 log4j2 配置。

更改

logger.spark_loggers.level = trace|debug|info
以更改 Spark 记录器的日志级别,以获得非常详细的日志。

注意文件内的注释。

# Reload config every X seconds, for testing out changes to
# this file without quiting your spark console session
# monitorInterval = 3  # seconds

property.log_pattern = %d{MM-dd HH:mm:ss.SSS} %5p [%t-%X{tid}] %c{*.1} - %m%n%exception{full}

appender.my_console.name = my_console
appender.my_console.type = Console
appender.my_console.target = SYSTEM_OUT
appender.my_console.layout.type = PatternLayout
appender.my_console.layout.pattern = ${log_pattern}

appender.my_file.name = my_file
appender.my_file.type = File
appender.my_file.filename = /tmp/my-spark.log
appender.my_file.layout.type = PatternLayout
appender.my_file.layout.pattern = ${log_pattern}
# Uncomment if you want to overwrite the log file everytime your program runs
# appender.my_file.Append=false

# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.my_console.filter.1.type = RegexFilter
appender.my_console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.my_console.filter.1.onMatch = deny
appender.my_console.filter.1.onMismatch = neutral


# Set all to goto the file
rootLogger.level = warn
# If you want to log everything to console, will be VERY noisy
# rootLogger.appenderRef.my_console.ref = my_console
rootLogger.appenderRef.my_file.ref = my_file

# update to change spark log level (all loggers with org.apache.spark prefix)
logger.spark_loggers.name = org.apache.spark
logger.spark_loggers.level = warn
logger.spark_loggers.additivity = false

# control loggers with com.kash.* prefix
logger.my_loggers.name = com.kash
logger.my_loggers.level = trace
# logger.my_loggers.appenderRef.my_console.ref = my_console
logger.my_loggers.appenderRef.my_file.ref = my_file
logger.my_loggers.additivity = false

# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn

logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn

# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error
© www.soinside.com 2019 - 2024. All rights reserved.