deltaTable update throws NoSuchMethodError - scala

I started to look into delta lake and got this exception when trying to update a table.
I'm using:
aws EMR 5.29
Spark 2.4.4
Scala version 2.11.12 and using io.delta:delta-core_2.11:0.5.0.
import io.delta.tables._
import org.apache.spark.sql.functions._
import spark.implicits._
val deltaTable = DeltaTable.forPath(spark, "s3://path/")
deltaTable.update(col("col1") === "val1", Map("col2" -> lit("val2")));
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;
at org.apache.spark.sql.delta.util.AnalysisHelper$class.tryResolveReferences(AnalysisHelper.scala:33)
at io.delta.tables.DeltaTable.tryResolveReferences(DeltaTable.scala:42)
at io.delta.tables.execution.DeltaTableOperations$$anonfun$5.apply(DeltaTableOperations.scala:93)
at io.delta.tables.execution.DeltaTableOperations$$anonfun$5.apply(DeltaTableOperations.scala:93)
at org.apache.spark.sql.catalyst.plans.logical.UpdateTable$$anonfun$1.apply(UpdateTable.scala:57)
at org.apache.spark.sql.catalyst.plans.logical.UpdateTable$$anonfun$1.apply(UpdateTable.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.logical.UpdateTable$.resolveReferences(UpdateTable.scala:52)
at io.delta.tables.execution.DeltaTableOperations$class.executeUpdate(DeltaTableOperations.scala:93)
at io.delta.tables.DeltaTable.executeUpdate(DeltaTable.scala:42)
at io.delta.tables.DeltaTable.updateExpr(DeltaTable.scala:361)
... 51 elided
any idea why?
Thanks!

Sorry for the inconvenience, but this is a bug in the version of Spark bundled with emr-5.29.0. It will be fixed in emr-5.30.0, but in the meantime you can use emr-5.28.0, which does not contain this bug.

This is usually because you are using an incompatible Spark version. You can print sc.version to check your Spark version.

Related

Spark-Snwowflake Connection Errors

Here are the versions I am using:
Spark - 3.0.1
Scala - 2.12.13
Python - 3.7.6
I am having issues running the below code. This is the basic connection to Snowflake via PySpark.
Here is my code:
# Spark imports
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
#
spark = SparkSession \
.builder \
.appName("Pyspark-Snowflake") \
.config('spark.jars','/Users/hana/spark-sf/snowflake-jdbc-3.12.1.jar,/Users/hana/spark-sf/spark-snowflake_2.11-2.8.1-spark_2.4.jar') \
.getOrCreate()
# Set options below
sfOptions = {
"sfURL" : "XXX",
"sfUser" : "XXX",
"sfPassword" : "XXX",
"sfRole": "XXX",
"sfDatabase" : "XXX",
"sfSchema" : "XXX",
"sfWarehouse" : "XXX"
}
# Set Snowflake source
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
# Read from Snowflake
#import net.snowflake.spark.snowflake.Utils.SNOWFLAKE_SOURCE_NAME
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from TABLE limit 100") \
.load()
df.show()
And here is the error I am getting (in Spyder):
Py4JJavaError: An error occurred while calling o40.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at net.snowflake.spark.snowflake.Parameters$MergedParameters.<init>(Parameters.scala:294)
at net.snowflake.spark.snowflake.Parameters$.mergeParameters(Parameters.scala:288)
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
20 more
What is wrong in my code / versions? I've tried multiple JDC versions and continue to get errors. Thank you in advance!
I can see from your spark.jars config that you are using the spark snowflake connector for version 2.4. Either re-run with spark version 2.4 installed.
pip install pyspark==2.4.4
Or use the jar file which is specific to spark snowflake connections for spark 3.0.
The naming convention of which to download can be found here: https://docs.snowflake.com/en/user-guide/spark-connector-install.html
It seems like you are using incorrect spark-snowflake jar version.
The naming convention of spark-snowflake jar represents every detail of what is supports.
For eg. spark-snowflake_2.11-2.8.1-spark_2.4.jar
This jar is supported for spark version 2.4 and Scala and version 2.11.
Please check the Spark and Scala version present in your system and use/download appropriate spark-snowflake jar version from maven repo

Spark Mongodb: Error - java.lang.NoClassDefFoundError: com/mongodb/MongoDriverInformation

I am working with Spark-shell using Mongo-spark-connector to read/write data into MongoDB, while I am facing the below error, besides placing the required JARS as follows, can someone find what the problem is and help me out!
Thank you in advance
Jars:
mongodb-driver-3.4.2.jar;
mongodb-driver-sync-3.11.0.jar;
mongodb-driver-core-3.4.2.jar;
mongo-java-driver-3.4.2.jar;
mongo-spark-connector_2.11-2.2.0.jar;
mongo-spark-connector_2.11-2.2.7.jar
Error:
scala> MongoSpark.save(dfRestaurants.write.option("spark.mongodb.output.uri", "mongodb://username:password#server_name").option("spark.mongodb.output.database", "admin").option("spark.mongodb.output.collection", "myCollection").mode("overwrite"));
**java.lang.NoClassDefFoundError: com/mongodb/MongoDriverInformation**
at com.mongodb.spark.connection.DefaultMongoClientFactory.mongoDriverInformation$lzycompute(DefaultMongoClientFactory.scala:40)
at com.mongodb.spark.connection.DefaultMongoClientFactory.mongoDriverInformation(DefaultMongoClientFactory.scala:40)
at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:49)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:242)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:155)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:187)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at com.mongodb.spark.MongoSpark$.save(MongoSpark.scala:192)
... 59 elided
This is a typical problem when you have incorrect dependencies. In your case:
Mongo Spark Connector 2.2.7 was built with driver 3.10/3.11, so it could be incompatible with driver 3.4
you have 2 different versions of Mongo Spark Connector - 2.2.0 & 2.2.7 - this also could lead to problems
The better solution is to pass Maven coordinates in --packages option when starting spark shell, and allow Spark pull the package with all necessary & correct dependencies:
spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:<version>
please make sure that you're using the Scala version that is matching your Spark version (2.12 for Spark 3.0, 2.11 for previous versions). See documentation for more details.

HiveException when running a sql example in Spark shell

a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.

Apache Kudu with Apache Spark NoSuchMethodError: exportAuthenticationCredentials

I have this function with Spark and Scala:
import org.apache.kudu.client.CreateTableOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Encoders, SparkSession}
import org.apache.kudu.spark.kudu._
def save(df: DataFrame): Unit ={
val kuduContext: KuduContext = new KuduContext("quickstart.cloudera:7051")
kuduContext.createTable(
"test_table", df.schema, Seq("anotheKey", "id", "date"),
new CreateTableOptions()
.setNumReplicas(1))
kuduContext.upsertRows(df, "test_table")
}
But when trying to create the kuduContext raises an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kudu.client.KuduClient.exportAuthenticationCredentials()[B
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:63)
at com.mypackge.myObject$.save(myObject.scala:24)
at com.mypackge.myObject$$anonfun$main$1.apply$mcV$sp(myObject.scala:59)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$.time(myObject.scala:17)
at com.mypackge.myObject$.main(myObject.scala:57)
at com.mypackge.myObject.main(myObject.scala)
Spark works without any problem. I have installed kudu VM as described on official docs and I have logged from bash to impala instance without a problem.
Someone have any idea about what I am doing wrong?
The problem was a dependency of the project using an old version of kudu-client (1.2.0), when I was using kudu-spark 1.3.0 (which includes kudu-client 1.3.0). Excluding kudu-client from pom.xml was the solution.

java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.
I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.
Consider following sample:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val t1 = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "t1",
"partitionColumn" -> "id",
"lowerBound" -> "0",
"upperBound" -> "100",
"numPartitions" -> "50"
))
}
So far so good, the schema gets resolved properly:
t1: org.apache.spark.sql.DataFrame = [id: int, name: string]
But when I evaluate DataFrame:
t1.take(1)
Following exception occurs:
15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I try to open JDBC connection on executor:
import java.sql.DriverManager
sc.parallelize(0 until 2, 2).map { i =>
Class.forName(driver)
val conn = DriverManager.getConnection(url)
conn.close()
i
}.collect()
it works perfectly:
res1: Array[Int] = Array(0, 1)
When I run the same code on local Spark, it works perfectly too:
scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])
I'm using Spark pre-built with Hadoop 2.4 support.
The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh script and run following command:
/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar
Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.
Apparently this issue has been recently reported:
https://issues.apache.org/jira/browse/SPARK-6913
The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.
As a temporary workaround it's possible to add required drivers to boot classpath of executors.
UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782
UPDATE 2: The fix merged to Spark 1.4
For writing data to MySQL
In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function.
We have to put jar on every worker node and set the path in spark-defaults.conf file on each node.
This issue has been fixed in spark 1.5.0
https://issues.apache.org/jira/browse/SPARK-10036
We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.
Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.
I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf file.
spark.driver.extraClassPath lib/sqljdbc-4.0.jar