在笔记本电脑(Ubuntu 盒子)上的开发环境中,我编写“pyspark 代码”并通过
pytest
运行它进行测试。我将使用 SparkSession.builder.getOrCreate()
创建的 Spark 会话注入到我的测试中。这个“pyspark代码”有一堆信息级别的日志语句。它们都没有出现在任何地方,所以我被迫写下print()
,后来我将其删除。
我想要的只是:
My.*
)%t, %m, %n, %d{YYYY-MM-DD}, ...
我读到:
log4j2.properties
。找不到任何有关 log4j2.properties
格式的文档。 样本log4j2.properties.template
非常有限。 注意,它没有说关于使用xml
或其他格式,只是properties
。log4j2.properties
内容的文档。它提供了一些样本,但这些样本没有用,例如在哪里指定文件附加器的文件名(不是滚动文件附加器)。ConfigurationException: No type attribute provided for Appender my_console
)。我不想通过反复试验来弄清楚。只是寻找文档。
my_spark_module.py
class MySparkModule:
def __init__(self, spark):
self.log = spark.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.SparkModule')
self.df = spark.createDataFrame([('data1', '2022-01-01T00:00:00'), ('data2', '2022-02-01T00:00:00')],
schema=['data', 'ts_column'])
self.log.info('exiting init()')
def get_count(self):
return self.df.count()
test_my_spark_module.py
import pytest
from pyspark.sql import SparkSession
@pytest.fixture(autouse=True, scope='module')
def spark_session():
return SparkSession.builder.getOrCreate()
class TestMySparkModule:
def test_tables(self):
spark_session = SparkSession.builder.getOrCreate()
log = spark_session.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.TestMySparkModule')
log.info('something I wanted to log')
assert MySparkModule(spark_session).get_count() == 1, 'expected count to be 1'
有用的链接:
要将 Spark 会话中的 log4j2 日志设置到
/tmp/my-spark.log
和标准输出,请将此文件放在 $SPARK_HOME/conf/log4j2.properties
中。
我们使用诗歌虚拟环境,所以
SPARK_HOME
是<project-root>/.venv/lib/python3.10/site-packages/pyspark/
并且它没有conf
文件夹。
所以我创建了conf文件夹并创建了<project-root>/.venv/lib/python3.10/site-packages/pyspark/conf/log4j2.properties
使用
<project-root>/.venv/bin/pyspark
启动的任何 Spark 会话都将使用此 log4j2 配置。
更改
logger.spark_loggers.level = trace|debug|info
以更改 Spark 记录器的日志级别,以获得非常详细的日志。
注意文件内的注释。
# Reload config every X seconds, for testing out changes to
# this file without quiting your spark console session
# monitorInterval = 3 # seconds
property.log_pattern = %d{MM-dd HH:mm:ss.SSS} %5p [%t-%X{tid}] %c{*.1} - %m%n%exception{full}
appender.my_console.name = my_console
appender.my_console.type = Console
appender.my_console.target = SYSTEM_OUT
appender.my_console.layout.type = PatternLayout
appender.my_console.layout.pattern = ${log_pattern}
appender.my_file.name = my_file
appender.my_file.type = File
appender.my_file.filename = /tmp/my-spark.log
appender.my_file.layout.type = PatternLayout
appender.my_file.layout.pattern = ${log_pattern}
# Uncomment if you want to overwrite the log file everytime your program runs
# appender.my_file.Append=false
# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.my_console.filter.1.type = RegexFilter
appender.my_console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.my_console.filter.1.onMatch = deny
appender.my_console.filter.1.onMismatch = neutral
# Set all to goto the file
rootLogger.level = warn
# If you want to log everything to console, will be VERY noisy
# rootLogger.appenderRef.my_console.ref = my_console
rootLogger.appenderRef.my_file.ref = my_file
# update to change spark log level (all loggers with org.apache.spark prefix)
logger.spark_loggers.name = org.apache.spark
logger.spark_loggers.level = warn
logger.spark_loggers.additivity = false
# control loggers with com.kash.* prefix
logger.my_loggers.name = com.kash
logger.my_loggers.level = trace
# logger.my_loggers.appenderRef.my_console.ref = my_console
logger.my_loggers.appenderRef.my_file.ref = my_file
logger.my_loggers.additivity = false
# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn
logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn
# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error