正在尝试将spark与pyspark中的oracle数据库连接,但是遇到驱动程序错误,请问有人可以帮助我。是Spark的新手,刚刚开始学习。下面是我的代码,
import pyspark
sc = pyspark.SparkContext('local[*]')
SqlContext = pyspark.SQLContext(sc)
Driver = 'C:\Hadoop\drivers\ojdbc14.jar'
OracleConnection = 'jdbc:oracle:thin:hr/hr@localhost:1521/xe'
Query = 'select * from employees'
OrcDb = SqlContext.read.format('jdbc') \
.option('url', OracleConnection) \
.option('dbtable', Query) \
.option('driver', Driver) \
.load()
OrcDb.printSchema()
下面是错误,
文件“ C:/Users/Macaulay/PycharmProjects/Spark/SparkSqlOracle.py”,第8行,在OrcDb = SqlContext.read.format('jdbc')\载入中的文件“ C:\ Hadoop \ Spark \ spark-3.0.0-preview2-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ readwriter.py”,行166文件“ C:\ Hadoop \ Spark \ spark-3.0.0-preview2-bin-hadoop2.7 \ python \ lib \ py4j-0.10.8.1-src.zip \ py4j \ java_gateway.py”在第[[致电装饰中的第98行的文件“ C:\ Hadoop \ Spark \ spark-3.0.0-preview2-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ utils.py”get_return_value中的第326行的文件“ C:\ Hadoop \ Spark \ spark-3.0.0-preview2-bin-hadoop2.7 \ python \ lib \ py4j-0.10.8.1-src.zip \ py4j \ protocol.py”py4j.protocol.Py4JJavaError:调用o29.load时发生错误。:java.lang.ClassNotFoundException:C:\ Hadoop \ drivers \ ojdbc14.jar在java.net.URLClassLoader $ 1.run(未知源)在java.net.URLClassLoader $ 1.run(未知源)在java.security.AccessController.doPrivileged(本机方法)在java.net.URLClassLoader.findClass(未知来源)在java.lang.ClassLoader.loadClass(未知来源)在java.lang.ClassLoader.loadClass(未知来源)在org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry $ .register(DriverRegistry.scala:45)在org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions。$ anonfun $ driverClass $ 1(JDBCOptions.scala:99)中在org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions。$ anonfun $ driverClass $ 1 $ adapted(JDBCOptions.scala:99)中在org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions $$ Lambda $ 729 / 1345147223.apply(未知来源)在scala.Option.foreach(Option.scala:407)在org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions。(JDBCOptions.scala:99)在org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions。(JDBCOptions.scala:35)在org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)在org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)在org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:240)在org.apache.spark.sql.DataFrameReader。$ anonfun $ load $ 2(DataFrameReader.scala:229)在org.apache.spark.sql.DataFrameReader $$ Lambda $ 719 / 1893144191.apply(来源不明)在scala.Option.getOrElse(Option.scala:189)在org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:229)在org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:179)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处在sun.reflect.NativeMethodAccessorImpl.invoke(未知来源)在sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)在java.lang.reflect.Method.invoke(未知来源)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在py4j.Gateway.invoke(Gateway.java:282)在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在py4j.commands.CallCommand.execute(CallCommand.java:79)在py4j.GatewayConnection.run(GatewayConnection.java:238)在java.lang.Thread.run(未知来源)
JDBC驱动程序应放置在Spark Jar目录中,而不是提供驱动程序的路径,我们必须提供驱动程序的服务名称。这种方法解决了这个问题。
下面是代码,import pyspark
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext('local[*]')
SqlContext = pyspark.SQLContext(sc)
spark = SparkSession(sc)
Driver = 'oracle.jdbc.driver.OracleDriver' # Driver's service name
OracleConnection = 'jdbc:oracle:thin:@//localhost:1521/xe'
User = 'hr'
Password = 'hr'
Query = 'select * from employees'
OrcDb = spark.read.format('jdbc') \
.option('url', OracleConnection) \
.option('dbtable', Query) \
.option('user', User) \
.option('Password', Password) \
.option('driver', Driver) \
.load()
OrcDb.printSchema()