How to run Spark on Zeppelin to analyze xml files - scala

I am able to run Spark shell by bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0 to analize xml files, for example:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
but how can I run Zeppelin to do it so. Does Zeppelin need some parameter at start to import com.databricks.spark.xml?
Now I am getting:
java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.xml at
scala.sys.package$.error(package.scala:27) at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) at
$iwC$$iwC$$iwC$$iwC$$iwC.(:35) at
$iwC$$iwC$$iwC$$iwC.(:37) at
$iwC$$iwC$$iwC.(:39) at $iwC$$iwC.(:41)
at $iwC.(:43) at (:45) at
.(:49) at .() at
.(:7) at .() at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:709)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:674)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:667)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:300)
at org.apache.zeppelin.scheduler.Job.run(Job.java:169) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:134)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

In Zeppelin, you need to call for those dependencies before creating the SparkContext.
In a separate cell you add and run the following
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-xml_2.11:0.3.0")
If this gives you an error from the type : "You have to add dependencies before starting your SparkContext" just restart the interpreter or Zeppelin.

The above solution is valid for older versions of zeppelin. For version >0.9 the %dep has been removed. Please refer to the latest doc

Related

SparkNLP NerDLModel load throws NoSuchMethodException

I am currently using John Snow labs SparkNLP library to train a custom NER Model. I am able to successfully complete the training and the model is getting saved to the disk. When I try to load the model for next step to actually tag some sample data I run into the following error. I using the Ubuntu on Windows 10. spark-nlp_2.12:3.1.2 with Spark 3.1.2, scala 2.12.10 OpenJDK 8
I have also tried the same on PySpark I get same the exact error
pyspark error:
java.lang.NoSuchMethodException:
org.apache.spark.ml.PipelineModel.(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:468)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Any help is appreciated. Scala version error
java.lang.NoSuchMethodException:
org.apache.spark.ml.PipelineModel.(java.lang.String) at
java.lang.Class.getConstructor0(Class.java:3082) at
java.lang.Class.getConstructor(Class.java:1825) at
org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:468)
at
com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
at
com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
val loaded_ner_model = NerDLModel.load("./NER_bert_prod_20200221")
.setInputCols("sentence", "token", "bert")
.setOutputCol("ner")

Create MongoDB External View using SparkSQL Only

I'm trying to query MongoDB using Spark SQL shell. I have a limitation that I can only use SQL: no Scala, Python, etc. I intend to use Thrift, but for proof of concept, I am using spark-sql. I'm using EMR with Spark version 2.4.4. More info:
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_242
Branch HEAD
Compiled by user ec2-user on 2019-12-14T00:54:30Z
Revision 5f788d5e8f90539ee331702c753fa250727128f4
Url git#aws157git.com:/pkg/Aws157BigTop
Type --help for more information.
I start my shell with a pointer to MongoDB Spark maven coordinates:
spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.12:2.4.1 --conf spark.mongodb.input.uri=mongodb://something.real/development?readPreference=secondary
Spark SQL seems to recognise the jar, via the logs:
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
Then I run
CREATE TEMPORARY VIEW mongo
USING com.mongodb.spark.sql.DefaultSource
OPTIONS (
collection 'accounts'
);
And I get the following error:
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any idea how to set up this view using SQL only, ideally without any startup command line options except the Maven coordinates would also be ace.
Looks like a mismatch between Scala versions.
Starting spark-sql with spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 worked alright.
To create the Mongo view using SQL only:
CREATE TEMPORARY VIEW source
USING mongo
OPTIONS (
uri 'mongodb://…',
collection 'my_collection'
);

Spark GPUenabler ClassNotFoundException: CacheGPU

I am using the package IBMSparkGPU/GPUenabler package. I use sbt assembly to package all the dependency into one single jar file and submit it to the spark standalone cluster manager. However, the following error message appear:
org.apache.spark.SparkException: Error sending message [message = CacheGPUDS(a99176e95cf37ba4e5e46b9b172369ac_-1728716590,false)]
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at com.ibm.gpuenabler.GPUMemoryManagerMasterEndPoint.com$ibm$gpuenabler$GPUMemoryManagerMasterEndPoint$$tell(GPUMemoryManager.scala:172)
at com.ibm.gpuenabler.GPUMemoryManagerMasterEndPoint$$anonfun$registerGPUMemoryManager$2.apply(GPUMemoryManager.scala:64)
at com.ibm.gpuenabler.GPUMemoryManagerMasterEndPoint$$anonfun$registerGPUMemoryManager$2.apply(GPUMemoryManager.scala:64)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at com.ibm.gpuenabler.GPUMemoryManagerMasterEndPoint.registerGPUMemoryManager(GPUMemoryManager.scala:64)
at com.ibm.gpuenabler.GPUMemoryManagerMasterEndPoint$$anonfun$receiveAndReply$1.applyOrElse(GPUMemoryManager.scala:143)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
... 16 more
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.ibm.gpuenabler.CacheGPUDS
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1866)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1749)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2040)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1571)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2285)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2209)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1571)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257)
at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:577)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:562)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
... more
There won't be this error message and the program can run if I submit to local[*] as master. I can also eliminate the error when disable the spark.gpuenabler.autocache. However, is there other way to properly fix the issue?
I am using Ubuntu 17.04, JRE 1.8.0, Scala 2.11 and Spark 2.1.0.
It turned out that adding the option
--conf "spark.executor.extraClassPath=file://path/to/jar" will solve the problem. Another thing is that I need to paste the jar file to all the machine with same path. Other wise the worker will not be able to get the jar file.

how to include jdbc jar in spark using maven

I have a spark (2.1.0) job that uses the postgres jdbc driver as described here: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
I'm using the dataframe writer like
val jdbcURL = s"jdbc:postgresql://${config.pgHost}:${config.pgPort}/${config.pgDatabase}?user=${config.pgUser}&password=${config.pgPassword}"
val connectionProperties = new Properties()
connectionProperties.put("user", config.pgUser)
connectionProperties.put("password", config.pgPassword)
dataFrame.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, tableName, connectionProperties)
I'm successfully including the jdbc driver from https://jdbc.postgresql.org/download/postgresql-42.1.1.jar downloading it manually and using --jars postgresql-42.1.1.jar --driver-class-path postgresql-42.1.1.jar
However, I'd prefer to not have to download it first.
I've unsuccessfully tried --jars https://jdbc.postgresql.org/download/postgresql-42.1.1.jar, but that fails from
Exception in thread "main" java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:364)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:480)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:600)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1154)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have also tried:
including "org.postgresql" % "postgresql" % "42.1.1" in my build.sbt file
spark-submit options: --repositories https://mvnrepository.com/artifact --packages org.postgresql:postgresql:42.1.1
spark-submit options: --repositories https://mvnrepository.com/artifact --conf "spark.jars.packages=org.postgresql:postgresql:42.1.1
these each fail the same way:
17/08/01 13:14:49 ERROR yarn.ApplicationMaster: User class threw exception: java.sql.SQLException: No suitable driver
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:446)
You can copy JDBC jar file to jars folder in spark directory and deploy your application with spark-submit without --jars option.
Specify the driver option like you do user and password with the JDBC class.

custom datasource in spark-sql with mist

I have custom data source which I use in sqlContext.read.format(....sparkjobs.fileloader.DataSource). This works well via spark-submit in local & yarn mode.
But, when I invoke the same job via Mist, it throws exception:
Failed to find data source. Please find packages at http://spark-packages.org
My fat jar has the DataSource class. And also, at the beginning of the code, I logged all the jars available at classpath and I can see my jar.
Error Log:
java.lang.ClassNotFoundException: Failed to find data source: c.p.b.f.sparkjobs.fileloader.DataSource. Pleas
e find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at c.p.b.f.sparkjobs.fileloader.m.McParser.processFile(McParser.scala:44)
at c.p.b.f.sparkjobs.fileloader.FileProcessor$$anonfun$processFilesSingle$1.apply(FileProcessor.scala:7
3)
at c.p.b.f.sparkjobs.fileloader.FileProcessor$$anonfun$processFilesSingle$1.apply(FileProcessor.scala:7
0)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at c.p.b.f.sparkjobs.fileloader.FileProcessor.processFilesSingle(FileProcessor.scala:70)
at c.p.b.f.sparkjobs.fileloader.FileProcessor$$anonfun$processFilesBulk$1.apply(FileProcessor.scala:58)
at c.p.b.f.sparkjobs.fileloader.FileProcessor$$anonfun$processFilesBulk$1.apply(FileProcessor.scala:49)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:174)
at c.p.b.f.sparkjobs.fileloader.FileProcessor.processFilesBulk(FileProcessor.scala:49)
at c.p.b.f.sparkjobs.fileloader.FileProcessor.init(FileProcessor.scala:33)
at c.p.b.f.sparkjobs.fileloader.Loader.execute(Loader.scala:162)
at c.p.b.f.util.LoaderApp$.execute(LoaderApp.scala:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at io.hydrosphere.mist.jobs.jar.JobInstance.io$hydrosphere$mist$jobs$jar$JobInstance$$invokeMethod(JobInstance.scala:40)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1$$anonfun$apply$3$$anonfun$apply$4.apply(JobInstance.scala:24)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1$$anonfun$apply$3$$anonfun$apply$4.apply(JobInstance.scala:24)
at cats.syntax.CatchOnlyPartiallyApplied.apply(either.scala:294)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1$$anonfun$apply$3.apply(JobInstance.scala:24)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1$$anonfun$apply$3.apply(JobInstance.scala:23)
at cats.syntax.EitherOps$.flatMap$extension(either.scala:129)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1.apply(JobInstance.scala:23)
at io.hydrosphere.mist.jobs.jar.JobInstance$$anonfun$run$1.apply(JobInstance.scala:22)
at cats.syntax.EitherOps$.flatMap$extension(either.scala:129)
at io.hydrosphere.mist.jobs.jar.JobInstance.run(JobInstance.scala:22)
at io.hydrosphere.mist.worker.runners.scala.ScalaRunner$$anonfun$run$1.apply(ScalaRunner.scala:26)
at io.hydrosphere.mist.worker.runners.scala.ScalaRunner$$anonfun$run$1.apply(ScalaRunner.scala:25)
at cats.syntax.EitherOps$.flatMap$extension(either.scala:129)
at io.hydrosphere.mist.worker.runners.scala.ScalaRunner.run(ScalaRunner.scala:25)
at io.hydrosphere.mist.worker.runners.MistJobRunner$.run(MistJobRunner.scala:18)
at io.hydrosphere.mist.worker.WorkerActor$$anonfun$2.apply(WorkerActor.scala:78)
at io.hydrosphere.mist.worker.WorkerActor$$anonfun$2.apply(WorkerActor.scala:76)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: c.p.b.f.sparkjobs.fileloader.DataSource.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:82)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 47 more