I need to add some external dependent library(spark-streaming-mqtt_2.10-1.5.2.jar in my case) to my pyspark word count code. I know we can add external jars with –jars property in spark submit or running with pyspark shell. But I want to add this jar in my code or in spark config file. I found that we have SparkContext.addJar() method , which can be included in code.
sc.addJar("spark-streaming-mqtt_2.10-1.5.2.jar")
But the above command is giving me error: AttributeError: 'SparkContext' object has no attribute 'addJar'.
I have tried adding this jar in Spark_default.config file as :
spark.driver.extraClassPath spark-streaming-mqtt_2.10-1.5.2.jar
spark.executor.extraClassPath spark-streaming-mqtt_2.10-1.5.2.jar
But this is also not working for me. I have looked in internet but not getting any useful link.
I am using Spark 1.5.2 with 1 namenode and 3 datanode in HDP cluster.
Can you please help me in solving the issue. It will be really thankful of you.
spark.driver.extraClassPath and spark.executor.extraClassPath will work, but this should be paths your Hadoop nodes as this files are not uploaded, they are just added to spark containters classpath.
It worked for me by adding external dependency in spark_deafult.config as
spark.jars spark-streaming-mqtt_2.10-1.5.2.jar
Now my job is taking external dependency.
Related
I am running databricks 7.3LTS and having errors while trying to use scala bulk copy.
The error is:
object sqldb is not a member of package com.microsoft.
I have installed correct sqlconnector drivers but not sure how to fix this error.
The installed drivers are:
com.microsoft.azure:spark-mssql-connector_2.12:1.1.0.
also i have installed the JAR dependencies as below:
spark_mssql_connector_2_12_1_1_0.jar
i couldnt find any scala code example for the above configurations on the internet.
my scala code sample is as below:
%scala
import com.microsoft.azure.sqldb.spark.config.Config
as soon as i run this command i get the error
Object sqldb is not a member of package com.microsoft.azure
any help please
In the new connector you need to use com.microsoft.sqlserver.jdbc.spark.SQLServerBulkJdbcOptions class to specify bulk copy options.
please see below screenshots , i am getting same issue while running spark prog,can you please helpjava.io.IOException: Could not locate executable C:\hadoop\bin\bin\winutils.exe in the Hadoop binaries.i have added necessary paths.
Add this comments in your program
downloaded the winutils.exe and placed it in C:/Bin/Winutils.exe
then added the following line to my project at the start of the function
System.setProperty("hadoop.home.dir", "C:\\winutil\\")
We can't figure out the following issue: we are trying to use Apache Hudi to save data to the storage. The problem is when we upload a fat jar which includes the org.json package in dependencies, the df.save() application is failing on
java.lang.NoClassDefFoundError: org/json/JSONException
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:384)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:367)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:357)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:262)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:176)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:321)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:363)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:359)
Even if I go to the cluster libraries and explicitly add this dependency it still fails on save. On the other hand, when I just create new JSONException("hello") in my notebook everything seem to work fine. What could cause this behaviour? Thanks
This is probably because the jar is not making it's way to the executor nodes, try addJar (https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addJar-java.lang.String-)
What version of Hudi are you using? There is a problem with JSON in version 0.6.0 and there is an opened issue. I suggest you to use version 0.5.2 by now.
Turns out that the problem was with different classpath between metastore service and spark process, because they run in separated JVM's. The problem was fixed in an init script that downloads the jar to the classpath folder.
I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.
When I develop an IDEA plugin, I want to use akka, but have some problems.
I created a demo project here: https://github.com/freewind/idea-plugin-akka-demo
You can just clone it and reproduce the problem on your computer. (Notice the Setup section)
And I copy the problem here:
Problems
1. Can't use default akka configuration
If I removed:
src/main/resources/application.conf
src/main/scala/freewind/MyAkkaConfig
and run this plugin, it will report this error when starting:
com.intellij.ide.plugins.PluginManager$StartupAbortedException:
com.intellij.diagnostic.PluginException: No configuration setting found for key 'akka'
[Plugin: com.yourcompany.unique.plugin.id]
2. Can't load the configuration from file
Then I copied the reference.conf from akka jar, to src/main/resources/application.conf, but it still report the same error. Seems akka in IDEA plugin can't find this file automatically.
3. ClassNotFoundException: akka.actor.LightArrayRevolverScheduler
So I have to use MyAkkaConfig.scala to hardcode the configuration in scala code, but this time, it reports another error:
com.intellij.ide.plugins.PluginManager$StartupAbortedException:
com.intellij.diagnostic.PluginException: ClassNotFoundException: akka.actor.LightArrayRevolverScheduler
[Plugin: com.yourcompany.unique.plugin.id]
The akka.actor.LightArrayRevolverScheduler is used in MyAkkaConfig.scala, and is included in akka-actor_2.11:2.3.12:jar. But why IDEA can't load it?
For the 3rd problem, it can be fixed by passing the classloader:
val system = ActorSystem("my-actor", MyAkkaConfig.config, this.getClass.getClassLoader)
But we also can remove the MyAkkaConfig.config, to use the file application.conf under resources