Spark write to Postgresql. BatchUpdateException? - postgresql

I have a simple Spark job that reads large log files, filters them, and writes results to a new table. The simplified Scala driver app code is:
val sourceRdd = sc.textFile(sourcePath)
val parsedRdd = sourceRdd.flatMap(parseRow)
val filteredRdd = parsedRdd.filter(l => filterLogEntry(l, beginDateTime, endDateTime))
val dataFrame = sqlContext.createDataFrame(filteredRdd)
val writer = dataFrame.write
val properties = new Properties()
properties.setProperty("user", "my_user")
properties.setProperty("password", "my_password")
writer.jdbc("jdbc:postgresql://ip_address/database_name", "my_table", properties)
This works perfectly on smaller batches. On a large batch, after two hours of execution, I see about 8 million records in the target table and the Spark job has failed with the following error:
Caused by: java.sql.BatchUpdateException: Batch entry 524 INSERT INTO my_table <snip> was aborted. Call getNextException to see the cause.
at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:136)
at org.postgresql.core.v3.QueryExecutorImpl$ErrorTrackingResultHandler.handleError(QueryExecutorImpl.java:308)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2004)
at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1187)
at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1212)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:351)
at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:1019)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:277)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:276)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If I copy paste the given SQL INSERT statement into an SQL console, it works fine. In the Postgresql server log I see:
(this is unmodified/unanonymized log)
2012016-04-26 22:38:09 GMT [3769-12] nginxlogs_admin#nginxlogs ERROR: syntax error at or near "was" at character 544
2016-04-26 22:38:09 GMT [3769-13] nginxlogs_admin#nginxlogs STATEMENT: INSERT INTO log_entries2 (client,host,req_t,request,seg,server,timestamp_value) VALUES ('68.67.161.5','"204.13.197.104"','0.000s','"GET /bid?apnx_id=&ip=67.221.130.195&idfa=&dmd5=&daid=&lt=32.90630&lg=-95.57920&u=branovate.com&ua=Mozilla%2F5.0+%28Linux%3B+Android+5.1%3B+XT1254+Build%2FSU3TL-39%3B+wv%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Chrome%2F44.0.2403.90+Mobile+Safari%2F537.36+%5BFB_IAB%2FFB4A%3BFBAV%2F39.0.0.36.238%3B%5D&ap=&c=1&dmdl=&dmk= HTTP/1.1"','samba_info_has_geo','','2015-08-02T20:24:30.482000112') was aborted. Call getNextException to see the cause.
It seems like Spark sent the text "was aborted. Call getNextException..." to Postgresql which triggered this specific error. That seems like a legitimate Spark bug. The second question is why did Spark abort this in the first place?
So, afaik, I can't call getNextException because I'm not using JDBC directly but going through Spark.
FYI, this is with Spark 1.6.1 and Scala 2.11.

If anyone else is searching and hits this, my database server (running in a VM) hit disk space limits, Spark seemed to get confused by this error, not log the real error, cause a different internal error, and log the results of that. Technically, this is probably an internal Spark bug responding to an uncommon database disk full error.

Related

Spark streaming Redis Read Time Out with Scala

While i'm reading table from redis getting this below error.
Below code normally working well.
val readDF= spark.sparkContext.fromRedisKeyPattern(tableName,5).getHash().toDS()
Normally it's working for less than 2 million rows. But if i'm reading big table getting this error.
18/10/11 17:08:25 ERROR Executor: Exception in task 37.0 in stage 3.0
(TID 338) redis.clients.jedis.exceptions.JedisConnectionException:
java.net.SocketTimeoutException: Read timed out at
redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:202)
at
redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
val redis =
spark.sparkContext.fromRedisKeyPattern(tableName,100).getHash().toDS()
I also changed some settings on redis but i think it's not about that.
Do you know how can i solve this problem ?

Spark is crashing when computing big files

I have a program in Scala that read a CSV file, add a new column to the Dataframe and save the result as a parquet file. It works perfectly on small files (<5 Go) but when I try to use bigger files (~80 Go) it always fail when it should write the parquet file with this stacktrace :
16/10/20 10:03:37 WARN scheduler.TaskSetManager: Lost task 14.0 in stage 4.0 (TID 886, 10.0.0.10): java.io.EOFException: reached end of stream after reading 136445 bytes; 1245184 bytes expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If anyone know what could cause this, that would help me a lot !
System used
Spark 2.0.1
Scala 2.11
Hadoop HDFS 2.7.3
All running in Docker in a 6 machine cluster (each 4 cores and 16 Go of RAM)
Example code
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName)))
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
Here are few points that might help you:
I think you should check distribution of your ipix column data, it might happen that you have data skew, so 1 or few partitions might be much bigger than other. Those fat partitions might be such that 1 task that is working on the fat partition might fail. It probably has something to do with output of your function a2p. I'd test first to run this job even without repartitioning(just remove this call and try to see if it succeeds - without repartition call it will use default partitions split probably by size of input csv file)
I also hope that your input csv is not gzip-ed(since gzip-ed data it's not splittable, so all data will be in 1 partition)
Can you provide code?
perhaps the code you wrote are running on driver? how do you process the file?
there is a special Spark functionality of handling big data, for example RDD.
once you do:
someRdd.collect()
You bring the rdd to the driver memory, hence not using the abilities of spark.
Code that handles big data should run on slaves.
please check this : differentiate driver code and work code in Apache Spark
The problem looks like the read failed when decompress a stream of shuffled data in YARN mode.
Try the following code and see how it goes.
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
There is also a similar issue Spark job failing in YARN mode

PySpark - Reading data from MongoDB using mongo-spark connector results in MongoQueryException for exceeding document size

I'm trying MongoDB's new Spark connector to read data from MongoDB. I supplied the DB and collection details to Spark conf object while starting the application. And then use the following piece of code to read to a dataframe
reader = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
df = reader.options(partitioner='MongoSplitVectorPartitioner').load()
Then write this dataframe to a parquet file
df.write.parquet(/destination/path)
It starts a job with a number of tasks which all succeed except the last task, which fails the whole writing job. I'm getting an exception on document size being greater than 16 MB.
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:269)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mongodb.MongoQueryException: Query failed with error code 16493 and error message 'Tried to create string longer than 16MB' on server xx.xxx.xx.xx:27018
at com.mongodb.connection.ProtocolHelper.getQueryFailureException(ProtocolHelper.java:131)
at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:96)
at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:49)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.getMore(DefaultServerConnection.java:251)
at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:218)
at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:118)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
I do not maintain the Mongo database. So I'm not sure how a document greater than 16 MB exists in first place. Could probably be use of GridFS.
Is there a way I can skip processing the bad records?
I tried using a udf to filter but it also failed on the same error
import sys
from pyspark.sql.functions import udf, col
size_filter_udf = udf(lambda entry: sys.getsizeof(entry), IntegerType())
filtered_df = df.where(size_filter_udf(col("caseNotes")) < 16000000)
filtered_df.write.parquet(/destination/path)

Different behaviour when reading from Parquet in standalone/master-slave spark-shell

Here is a snippet from a larger code I'm using to read a dataframe from Parquet in Scala.
case class COOMatrix(row: Seq[Long], col: Seq[Long], data: Seq[Double])
def buildMatrix(cooMatrixFields: DataFrame) = {
val cooMatrices = cooMatrixFields map {
case Row(r,c,d) => COOMatrix(r.asInstanceOf[Seq[Long]], c.asInstanceOf[Seq[Long]], d.asInstanceOf[Seq[Double]])
}
val matEntries = cooMatrices.zipWithIndex.flatMap {
case (cooMat, matIndex) =>
val rowOffset = cooMat.row.distinct.size
val colOffset = cooMat.col.distinct.size
val cooMatRowShifted = cooMat.row.map(rowEntry => rowEntry + rowOffset * matIndex)
val cooMatColShifted = cooMat.col.map(colEntry => colEntry + colOffset * matIndex)
(cooMatRowShifted, cooMatColShifted, cooMat.data).zipped.map {
case (i, j, value) => MatrixEntry(i, j, value)
}
}
new CoordinateMatrix(matEntries)
}
val C_entries = sqlContext.read.load(s"${dataBaseDir}/C.parquet")
val C = buildMatrix(C_entries)
My code executes successfully when running in a local spark context.
On a standalone cluster, the very same code fails as soon as it reaches an action that forces it to actually read from Parquet.
The dataframe's schema is retrieved correctly:
C_entries: org.apache.spark.sql.DataFrame = [C_row: array<bigint>, C_col: array<bigint>, C_data: array<double>]
But the executors crash when executing this line val C = buildMatrix(C_entries), with this exception:
java.lang.ExceptionInInitializerError
at $line39.$read$$iwC.<init>(<console>:7)
at $line39.$read.<init>(<console>:61)
at $line39.$read$.<init>(<console>:65)
at $line39.$read$.<clinit>(<console>)
at $line67.$read$$iwC.<init>(<console>:7)
at $line67.$read.<init>(<console>:24)
at $line67.$read$.<init>(<console>:28)
at $line67.$read$.<clinit>(<console>)
at $line68.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$3.apply(<console>:63)
at $line68.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$3.apply(<console>:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at $line4.$read$$iwC$$iwC.<init>(<console>:15)
at $line4.$read$$iwC.<init>(<console>:24)
at $line4.$read.<init>(<console>:26)
at $line4.$read$.<init>(<console>:30)
at $line4.$read$.<clinit>(<console>)
... 22 more
Not sure it's related, but while increasing the log verbosity, i've noticed this exception:
16/03/07 20:59:38 INFO GenerateUnsafeProjection: Code generated in 157.285464 ms
16/03/07 20:59:38 DEBUG ExecutorClassLoader: Did not load class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection from REPL class server at http://155.198.193.158:32862
java.lang.ClassNotFoundException: Class file not found at URL http://155.198.193.158:32862/org/apache/spark/sql/catalyst/expressions/GeneratedClass%24SpecificUnsafeProjection.class
I've tried different configurations for the standalone cluster:
master, 1 slave and spark-shell running on my laptop
master and 1 slave each running on separate machines, spark-shell on my laptop
master and spark-shell on one machine, 1 slave on another one
I've started with the default properties and evolved to a more convoluted properties file without more success:
spark.driver.memory 4g
spark.rpc=netty
spark.eventLog.enabled true
spark.eventLog.dir file:///mnt/fastmp/spark_workdir/logs
spark.driver.extraJavaOptions -Xmx20480m -XX:MaxPermSize=2048m -XX:ReservedCodeCacheSize=2048m
spark.shuffle.service.enabled true
spark.shuffle.consolidateFiles true
spark.sql.parquet.binaryAsString true
spark.speculation false
spark.rpc.timeout 1000
spark.rdd.compress true
spark.core.connection.ack.wait.timeout 600
spark.driver.maxResultSize 0
spark.task.maxFailures 3
spark.shuffle.io.maxRetries 3
I'm running the pre-built version of spark-1.6.0-bin-hadoop2.6.
There's no HDFS involved in this deployment, all Parquet files are stored on a shared mount (CephFS) available to all the machines.
I doubt this is related to the underlying file system, as another part of my code reads a different Parquet file fine in both local and standalone mode.
TL;DR: package your code as a jar
For record purpose, the problem seemed to be linked to the use of a standalone cluster.
The exact same code works fine with these setups:
spark-shell and master on the same machine
running on YARN (AWS EMR cluster) and reading the parquet files from S3
With a bit more digging in the logs of the standalone setup, the problem seems to be linked to this exception with the class server:
INFO GenerateUnsafeProjection: Code generated in 157.285464 ms
DEBUG ExecutorClassLoader: Did not load class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection from REPL class server at http://155.198.193.158:32862
java.lang.ClassNotFoundException: Class file not found at URL http://155.198.193.158:32862/org/apache/spark/sql/catalyst/expressions/GeneratedClass%24SpecificUnsafeProjection.class
My understanding is that the spark-shell starts an HTTP server (jetty) in order to serve the classes it generates from the code in the REPL to the workers.
In my case, lots of classes are served successfully (i've even managed to retrieve some through telnet). However the class GeneratedClass (and all its inner classes) can't be found by the class server.
The typical error message appearing in the log is:
DEBUG Server: RESPONSE /org/apache/spark/sql/catalyst/expressions/GeneratedClass.class 404 handled=true
My idea is that it works with master and spark-shell on the same server as they run in the same JVM so the class can be found even though the HTTP transfer fails.
The only successful solution I've found so far is to build a jar package and use the --jars option of spark-shell or pass it as a parameter to spark-submit.

java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.
I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.
Consider following sample:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val t1 = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "t1",
"partitionColumn" -> "id",
"lowerBound" -> "0",
"upperBound" -> "100",
"numPartitions" -> "50"
))
}
So far so good, the schema gets resolved properly:
t1: org.apache.spark.sql.DataFrame = [id: int, name: string]
But when I evaluate DataFrame:
t1.take(1)
Following exception occurs:
15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I try to open JDBC connection on executor:
import java.sql.DriverManager
sc.parallelize(0 until 2, 2).map { i =>
Class.forName(driver)
val conn = DriverManager.getConnection(url)
conn.close()
i
}.collect()
it works perfectly:
res1: Array[Int] = Array(0, 1)
When I run the same code on local Spark, it works perfectly too:
scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])
I'm using Spark pre-built with Hadoop 2.4 support.
The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh script and run following command:
/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar
Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.
Apparently this issue has been recently reported:
https://issues.apache.org/jira/browse/SPARK-6913
The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.
As a temporary workaround it's possible to add required drivers to boot classpath of executors.
UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782
UPDATE 2: The fix merged to Spark 1.4
For writing data to MySQL
In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function.
We have to put jar on every worker node and set the path in spark-defaults.conf file on each node.
This issue has been fixed in spark 1.5.0
https://issues.apache.org/jira/browse/SPARK-10036
We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.
Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.
I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf file.
spark.driver.extraClassPath lib/sqljdbc-4.0.jar