我遇到了这个问题
Py4JJavaError: An error occurred while calling o124.save. : org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
当我在 Jupyter notbook 上运行此 PySark 代码并使用 docker 运行所有内容时,postgreSQL 已安装在本地计算机(Windows)中。
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, explode
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("ETL Pipeline").config("spark.jars", "./postgresql-42.7.1.jar").getOrCreate()
df = spark.read.text("./Data/WordData.txt")
df2 = df.withColumn("splitedData", f.split("value"," "))
df3 = df2.withColumn("words", explode("splitedData"))
wordsDF = df3.select("words")
wordCount = wordsDF.groupBy("words").count()
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost:5432/local_database"
table = "word_count"
user = "postgres"
password = "12345"
wordCount.write.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("dbtable", table) \
.option("mode", "append") \
.option("user", user) \
.option("password", password) \
.save()
spark.stop()
我尝试编辑 postgresql.conf 添加“listen_addresses = 'localhost'”并编辑 pg_hba.conf 添加“host all all 0.0.0.0/0 md5”,但它对我不起作用,所以我不知道该怎么办。
我也解决了在docker上安装PostgreSQL的问题(使用此图像https://hub.docker.com/_/postgres/为postgres创建一个容器)并在PySpark容器和postgreSQL容器之间创建一个网络命令
docker network create my_network
,
此命令适用于 postgres 容器
docker run --name postgres_container --network my_network -e POSTGRES_PASSWORD=12345 -d -p 5432:5432 postgres:latest
这个用于 Jupyter-pyspark 容器
docker run --name jupyter_container --network my_network -it -p 8888:8888 -v C:\home\work\path:/home/jovyan/work jupyter/pyspark-notebook:latest