我在虚拟机中工作,在其中设置了整个 Spark 工作区并将其连接到 Jupyter Notebook。 这个问题不是关于如何连接Databricks中的数据湖。我只在我的虚拟机中工作。 现在我想连接到 Azure Data Lake Gen2 来读取我的文件。我安装了以下版本:
据我所知,版本是相互兼容的,所以问题不在于这里。
所以我的问题是为什么这不起作用:
from pyspark.sql import SparkSession`
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
# Get Sas Token
key_vault_url = "https://<<keyvault>>.vault.azure.net/"
credential = DefaultAzureCredential()
client = SecretClient(vault_url=key_vault_url, credential=credential)
sastoken = client.get_secret(<<SAStoken>>)
# Paths to your JAR files
path_to_hadoop_azure_jar = "/opt/spark/jars/hadoop-azure-3.3.4.jar"
path_to_azure_storage_jar = "/opt/spark/jars/azure-storage-8.6.6.jar"
path_to_jetty_util_ajax_jar = "/opt/spark/jars/jetty-util-ajax-11.0.18.jar"
path_to_jetty_util_jar = "/opt/spark/jars/jetty-util-11.0.18.jar"
path_to_azure_datalake_jar = "/opt/spark/jars/hadoop-azure-datalake-3.3.6.jar"
spark = SparkSession.builder.appName("AzureDataRead") \
.config("spark.driver.extraClassPath", path_to_hadoop_azure_jar) \
.config("spark.executor.extraClassPath", path_to_hadoop_azure_jar) \
.config("spark.jars", f"{path_to_hadoop_azure_jar},{path_to_azure_storage_jar},{path_to_jetty_util_ajax_jar},{path_to_jetty_util_jar},{path_to_azure_datalake_jar}") \
.config("fs.azure.sas.<<container>>.<<datalake>>.dfs.core.windows.net", sastoken) \
.getOrCreate()
file_path = "/data/<<file>>"
# Read example file
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(f"wasbs://<<container>>.<<datalake>>.dfs.core.windows.net/{file_path}")
# Show the DataFrame
df.show()
我收到以下错误,据我了解,主要问题在于 AzureNativeFileSystemStore 类:
Py4JJavaError: An error occurred while calling o158.load.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1485)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1410)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
我已经检查了 jar 文件的路径和权限,它们应该都是正确的。但显然它仍然不起作用,因为无论我的 jar 文件如何,代码都会抛出此错误。
有人可以帮忙吗?
将您的罐子更改为以下罐子:
jetty-util-ajax-9.4.48.v20220622.jar
jetty-util-9.4.48.v20220622.jar
jetty-server-9.4.48.v20220622.jar
代替:
jetty-util-ajax-11.0.18.jar
jetty-util-11.0.18.jar
用途:
blob.core.windows.net
对于配置 blob 存储,不是:
dfs.core.windows.net
加载时,使用以下路径:
wasbs://<container_name>@<account_name>.blob.core.windows.net/path_to_file/
输出: