将 Apache Spark 连接到 Azure 数据湖(第 2 代)

问题描述 投票:0回答:1

我在虚拟机中工作,在其中设置了整个 Spark 工作区并将其连接到 Jupyter Notebook。 这个问题不是关于如何连接Databricks中的数据湖。我只在我的虚拟机中工作。 现在我想连接到 Azure Data Lake Gen2 来读取我的文件。我安装了以下版本:

  • JDK 11.0.20.1
  • Python 2.7.18
  • 火花3.5.0

据我所知,版本是相互兼容的,所以问题不在于这里。

所以我的问题是为什么这不起作用:

from pyspark.sql import SparkSession`
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

# Get Sas Token
key_vault_url = "https://<<keyvault>>.vault.azure.net/"
credential = DefaultAzureCredential() 
client = SecretClient(vault_url=key_vault_url, credential=credential)
sastoken = client.get_secret(<<SAStoken>>)

# Paths to your JAR files
path_to_hadoop_azure_jar = "/opt/spark/jars/hadoop-azure-3.3.4.jar"
path_to_azure_storage_jar = "/opt/spark/jars/azure-storage-8.6.6.jar"
path_to_jetty_util_ajax_jar = "/opt/spark/jars/jetty-util-ajax-11.0.18.jar"
path_to_jetty_util_jar = "/opt/spark/jars/jetty-util-11.0.18.jar"
path_to_azure_datalake_jar = "/opt/spark/jars/hadoop-azure-datalake-3.3.6.jar"

spark = SparkSession.builder.appName("AzureDataRead") \
    .config("spark.driver.extraClassPath", path_to_hadoop_azure_jar) \
    .config("spark.executor.extraClassPath", path_to_hadoop_azure_jar) \
    .config("spark.jars", f"{path_to_hadoop_azure_jar},{path_to_azure_storage_jar},{path_to_jetty_util_ajax_jar},{path_to_jetty_util_jar},{path_to_azure_datalake_jar}") \
    .config("fs.azure.sas.<<container>>.<<datalake>>.dfs.core.windows.net", sastoken) \
    .getOrCreate()

file_path = "/data/<<file>>" 

# Read example file
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(f"wasbs://<<container>>.<<datalake>>.dfs.core.windows.net/{file_path}")

# Show the DataFrame
df.show()

我收到以下错误,据我了解,主要问题在于 AzureNativeFileSystemStore 类:

Py4JJavaError: An error occurred while calling o158.load.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1485)
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1410)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)

我已经检查了 jar 文件的路径和权限,它们应该都是正确的。但显然它仍然不起作用,因为无论我的 jar 文件如何,代码都会抛出此错误。

有人可以帮忙吗?

azure apache-spark pyspark virtual-machine azure-data-lake
1个回答
0
投票

将您的罐子更改为以下罐子:

jetty-util-ajax-9.4.48.v20220622.jar
jetty-util-9.4.48.v20220622.jar 
jetty-server-9.4.48.v20220622.jar

代替:

jetty-util-ajax-11.0.18.jar
jetty-util-11.0.18.jar

用途:

blob.core.windows.net

对于配置 blob 存储,不是:

dfs.core.windows.net

加载时,使用以下路径:

wasbs://<container_name>@<account_name>.blob.core.windows.net/path_to_file/

输出:

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.