Different behaviour when reading from Parquet in standalone/master-slave spark-shell - scala

Here is a snippet from a larger code I'm using to read a dataframe from Parquet in Scala.
case class COOMatrix(row: Seq[Long], col: Seq[Long], data: Seq[Double])
def buildMatrix(cooMatrixFields: DataFrame) = {
val cooMatrices = cooMatrixFields map {
case Row(r,c,d) => COOMatrix(r.asInstanceOf[Seq[Long]], c.asInstanceOf[Seq[Long]], d.asInstanceOf[Seq[Double]])
}
val matEntries = cooMatrices.zipWithIndex.flatMap {
case (cooMat, matIndex) =>
val rowOffset = cooMat.row.distinct.size
val colOffset = cooMat.col.distinct.size
val cooMatRowShifted = cooMat.row.map(rowEntry => rowEntry + rowOffset * matIndex)
val cooMatColShifted = cooMat.col.map(colEntry => colEntry + colOffset * matIndex)
(cooMatRowShifted, cooMatColShifted, cooMat.data).zipped.map {
case (i, j, value) => MatrixEntry(i, j, value)
}
}
new CoordinateMatrix(matEntries)
}
val C_entries = sqlContext.read.load(s"${dataBaseDir}/C.parquet")
val C = buildMatrix(C_entries)
My code executes successfully when running in a local spark context.
On a standalone cluster, the very same code fails as soon as it reaches an action that forces it to actually read from Parquet.
The dataframe's schema is retrieved correctly:
C_entries: org.apache.spark.sql.DataFrame = [C_row: array<bigint>, C_col: array<bigint>, C_data: array<double>]
But the executors crash when executing this line val C = buildMatrix(C_entries), with this exception:
java.lang.ExceptionInInitializerError
at $line39.$read$$iwC.<init>(<console>:7)
at $line39.$read.<init>(<console>:61)
at $line39.$read$.<init>(<console>:65)
at $line39.$read$.<clinit>(<console>)
at $line67.$read$$iwC.<init>(<console>:7)
at $line67.$read.<init>(<console>:24)
at $line67.$read$.<init>(<console>:28)
at $line67.$read$.<clinit>(<console>)
at $line68.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$3.apply(<console>:63)
at $line68.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$3.apply(<console>:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at $line4.$read$$iwC$$iwC.<init>(<console>:15)
at $line4.$read$$iwC.<init>(<console>:24)
at $line4.$read.<init>(<console>:26)
at $line4.$read$.<init>(<console>:30)
at $line4.$read$.<clinit>(<console>)
... 22 more
Not sure it's related, but while increasing the log verbosity, i've noticed this exception:
16/03/07 20:59:38 INFO GenerateUnsafeProjection: Code generated in 157.285464 ms
16/03/07 20:59:38 DEBUG ExecutorClassLoader: Did not load class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection from REPL class server at http://155.198.193.158:32862
java.lang.ClassNotFoundException: Class file not found at URL http://155.198.193.158:32862/org/apache/spark/sql/catalyst/expressions/GeneratedClass%24SpecificUnsafeProjection.class
I've tried different configurations for the standalone cluster:
master, 1 slave and spark-shell running on my laptop
master and 1 slave each running on separate machines, spark-shell on my laptop
master and spark-shell on one machine, 1 slave on another one
I've started with the default properties and evolved to a more convoluted properties file without more success:
spark.driver.memory 4g
spark.rpc=netty
spark.eventLog.enabled true
spark.eventLog.dir file:///mnt/fastmp/spark_workdir/logs
spark.driver.extraJavaOptions -Xmx20480m -XX:MaxPermSize=2048m -XX:ReservedCodeCacheSize=2048m
spark.shuffle.service.enabled true
spark.shuffle.consolidateFiles true
spark.sql.parquet.binaryAsString true
spark.speculation false
spark.rpc.timeout 1000
spark.rdd.compress true
spark.core.connection.ack.wait.timeout 600
spark.driver.maxResultSize 0
spark.task.maxFailures 3
spark.shuffle.io.maxRetries 3
I'm running the pre-built version of spark-1.6.0-bin-hadoop2.6.
There's no HDFS involved in this deployment, all Parquet files are stored on a shared mount (CephFS) available to all the machines.
I doubt this is related to the underlying file system, as another part of my code reads a different Parquet file fine in both local and standalone mode.

TL;DR: package your code as a jar
For record purpose, the problem seemed to be linked to the use of a standalone cluster.
The exact same code works fine with these setups:
spark-shell and master on the same machine
running on YARN (AWS EMR cluster) and reading the parquet files from S3
With a bit more digging in the logs of the standalone setup, the problem seems to be linked to this exception with the class server:
INFO GenerateUnsafeProjection: Code generated in 157.285464 ms
DEBUG ExecutorClassLoader: Did not load class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection from REPL class server at http://155.198.193.158:32862
java.lang.ClassNotFoundException: Class file not found at URL http://155.198.193.158:32862/org/apache/spark/sql/catalyst/expressions/GeneratedClass%24SpecificUnsafeProjection.class
My understanding is that the spark-shell starts an HTTP server (jetty) in order to serve the classes it generates from the code in the REPL to the workers.
In my case, lots of classes are served successfully (i've even managed to retrieve some through telnet). However the class GeneratedClass (and all its inner classes) can't be found by the class server.
The typical error message appearing in the log is:
DEBUG Server: RESPONSE /org/apache/spark/sql/catalyst/expressions/GeneratedClass.class 404 handled=true
My idea is that it works with master and spark-shell on the same server as they run in the same JVM so the class can be found even though the HTTP transfer fails.
The only successful solution I've found so far is to build a jar package and use the --jars option of spark-shell or pass it as a parameter to spark-submit.

Related

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.
I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Spark and Amazon S3 not setting credentials in executors

Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
My code is as follows:
val conf = new SparkConf().setAppName("BackupS3")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
I can write to Amazon S3 but cannot read! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4:
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I tried passing the credentials this way too. If i put them in the hdfs-site.xml in every machine it works.
My question, is how can I do it from code? Why are the executors not getting the config i pass them from the code?
I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4.
Thanks
Don't put secrets the keys, that leads to loss of secrets
If you are running in EC2, your secrets will be picked up automatically from the IAM feature; the client asks a magic web server for session secrets.
...which means: it may be that spark's automatic credential propagation is getting in the way. Unset your AWS_ env vars before submitting the work.
If you set these properties explicitly in your code, the values will only be visible to the driver process. The executors will not have a chance to pick up those credentials.
If you had set them in actual config file like core-site.xml, they will propagate.
Your code would work in local mode because all operations are happening in a single process.
Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails.
Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service.
(*) I am still learning about this myself, and I may need to revise this answer as I learn more

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.
I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.
Consider following sample:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val t1 = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "t1",
"partitionColumn" -> "id",
"lowerBound" -> "0",
"upperBound" -> "100",
"numPartitions" -> "50"
))
}
So far so good, the schema gets resolved properly:
t1: org.apache.spark.sql.DataFrame = [id: int, name: string]
But when I evaluate DataFrame:
t1.take(1)
Following exception occurs:
15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I try to open JDBC connection on executor:
import java.sql.DriverManager
sc.parallelize(0 until 2, 2).map { i =>
Class.forName(driver)
val conn = DriverManager.getConnection(url)
conn.close()
i
}.collect()
it works perfectly:
res1: Array[Int] = Array(0, 1)
When I run the same code on local Spark, it works perfectly too:
scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])
I'm using Spark pre-built with Hadoop 2.4 support.
The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh script and run following command:
/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar
Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.
Apparently this issue has been recently reported:
https://issues.apache.org/jira/browse/SPARK-6913
The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.
As a temporary workaround it's possible to add required drivers to boot classpath of executors.
UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782
UPDATE 2: The fix merged to Spark 1.4
For writing data to MySQL
In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function.
We have to put jar on every worker node and set the path in spark-defaults.conf file on each node.
This issue has been fixed in spark 1.5.0
https://issues.apache.org/jira/browse/SPARK-10036
We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.
Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.
I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf file.
spark.driver.extraClassPath lib/sqljdbc-4.0.jar