How to add databricks avro jar to hdinsight - scala

I'm currently trying to run a Spark Scala job on our HDInsight cluster with the external library spark-avro, without success. Could someone help me out with this? The goal is to find the necesseray steps to be able to read avro files residing on Azure blob storage on HDInsight clusters.
Current specs:
Spark 2.0 on Linux (HDI 3.5) clustertype
Scala 2.11.8
spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar
spark-avro_2.11:3.2.0
tutorial used: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-intellij-tool-plugin
Spark scala code:
based on the example on: https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
object AvroReader {
def main (arg: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.avro("wasb://container#storageaccount.blob.core.windows.net/directory")
df.head(5)
}
}
Error received:
java.lang.NoClassDefFoundError: com/databricks/spark/avro/package$
at MediahuisHDInsight.AvroReader$.main(AvroReader.scala:14)
at MediahuisHDInsight.AvroReader.main(AvroReader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.avro.package$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

By default your default_artifact.jar only contains your classes, not classes from the libraries you reference. You can presumably use the "Referenced Jars" input field for this.
Another way is to add your libraries, unpacked, to your artifact. Go to File -> Project Structure. Under Available Elements, right-click the spark-avro library and select Extract Into Output Root. Click OK, then Build -> Build Artifacts and resubmit.

Related

Netezza connection with Spark / Scala JDBC

I've set up Spark 2.2.0 on my Windows machine using Scala 2.11.8 on IntelliJ IDE. I'm trying to make Spark connect to Netezza using JDBC drivers.
I've read through this link and added the com.ibm.spark.netezzajars to my project through Maven. I attempt to run the Scala script below just to test the connection:
package jdbc
object SimpleScalaSpark {
def main(args: Array[String]) {
import org.apache.spark.sql.{SparkSession, SQLContext}
import com.ibm.spark.netezza
val spark = SparkSession.builder
.master("local")
.appName("SimpleScalaSpark")
.getOrCreate()
val sqlContext = SparkSession.builder()
.appName("SimpleScalaSpark")
.master("local")
.getOrCreate()
val nzoptions = Map("url" -> "jdbc:netezza://SERVER:5480/DATABASE",
"user" -> "USER",
"password" -> "PASSWORD",
"dbtable" -> "ADMIN.TABLENAME")
val df = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
}
}
However I get the following error:
17/07/27 16:28:17 ERROR NetezzaJdbcUtils$: Couldn't find class org.netezza.Driver
java.lang.ClassNotFoundException: org.netezza.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:49)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:netezza://SERVER:5480/DATABASE
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:54)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
I have two ideas:
1) I don't don't believe I actually installed any Netezza JDBC driver, though I thought the jars I brought into my project from the link above was sufficient. Am I just missing a driver or am I missing something in my Scala script?
2) In the same link, the author makes mention of starting the Netezza Spark package:
For example, to use the Spark Netezza package with Spark’s interactive
shell, start it as shown below:
$SPARK_HOME/bin/spark-shell –packages
com.ibm.SparkTC:spark-netezza_2.10:0.1.1
–driver-class-path~/nzjdbc.jar
I don't believe I'm invoking any package apart from jdbc in my script. Do I have to add that to my script?
Thanks!
Your 1st idea is right, I think. You almost certainly need to install the Netezza JDBC driver if you have not done this already.
From the link you posted:
This package can be deployed as part of an application program or from
Spark tools such as spark-shell, spark-sql. To use the package in the
application, you have to specify it in your application’s build
dependency. When using from Spark tools, add the package using
–packages command line option. Netezza JDBC driver also should be
added to the application dependencies.
The Netezza driver is something you have to download yourself, and you need support entitlement to get access to it (via IBM's Fix Central or Passport Advantage). It is included in either the Windows driver/client support package, or the linux driver package.

Zeppelin 6.5 + Apache Kafka connector for Structured Streaming 2.0.2

I'm trying to run a zeppelin notebook that contains spark's Structured Streaming example with Kafka connector.
>kafka is up and running on localhost port 9092
>from zeppelin notebook, sc.version returns String = 2.0.2
Here is my environment:
kafka: kafka_2.10-0.10.1.0
zeppelin: zeppelin-0.6.2-bin-all
spark: spark-2.0.2-bin-hadoop2.7
Here is the code in my zeppelin notebook:
import org.apache.enter code herespark.sql.functions.{explode, split}
// Setup connection to Kafka val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
// comma separated list of broker:host
.option("subscribe", "twitter")
// comma separated list of topics
.option("startingOffsets", "latest")
// read data from the end of the stream .load()
Here is the error I'm getting when I run the notebook:
import org.apache.spark.sql.functions.{explode, split}
java.lang.ClassNotFoundException: Failed to find data source: kafka.
Please find packages at
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:218)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 86 elided Caused by: java.lang.ClassNotFoundException:
kafka.DefaultSource at
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at scala.util.Try$.apply(Try.scala:192)
Any help advice would be greatly appreciated.
Thnx
You probably have figured this out already but putting in the answer for others, you have to add the following to zeppelin-env.sh.j2
SPARK_SUBMIT_OPTIONS=--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0
along with potentially other dependencies if you are using the kafka client:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-sql_2.11:2.1.0,org.apache.kafka:kafka_2.11:0.10.0.1,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0,org.apache.kafka:kafka-clients:0.10.0.1
This solution has been tested in zeppelin version 0.10.1.
You need to add dependencies of your code. It can be done with zeppelin UI. Go to Interpreter panel (http://localhost:8080/#/interpreter) and in spark section, under Dependencies you can add artifact of each dependency. If by adding spark-sql-kafka you ran into other dependency issues, add all packages the spark-sql-kafka needs. You can find them in Compile Dependencies section of it's maven repository.
I'm working with spark version 3.0.0 and scala version 2.12 and I was trying to integrate spark with kafka. I managed to get passed this issue by adding all the bellow artifacts:
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0
com.google.code.findbugs:jsr305:3.0.0
org.apache.spark:spark-tags_2.12:3.3.0
org.apache.spark:spark-token-provider-kafka-0-10_2.12:3.3.0
org.apache.kafka:kafka-clients:3.0.0

Trying to run Scala test, getting java.lang.ClassNotFoundException: org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner

I have a Gradle project in IntelliJ IDEA 2016.2. Everytime I run the Scala tests in the project, I get the following exception:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:48)
Caused by: java.lang.ClassNotFoundException: org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:123)
... 5 more
I checked the versions of the dependencies and I have added the Scala SDK to the project module as well. I also added the Scala plugin to the Gradle file and installed the Scala plugin in IntelliJ IDEA. Also, the tests run without an error on my colleague's computer so we have no idea what the error could be.
Found the cause: I have an accentuated letter in my user directory's name and IDEA is always trying to use some file from AppData under that directory. I have already changed the idea.properties file, but it has no effect regarding that file.
A possible workaround is using gradle (or maven/sbt/etc.). In my case, I use gradle, I just add #RunWith(classOf[JUnitRunner]) to the scala class I want to test, then execute gradle's test task.
For me the solution of command line length limitation was crucial. Idea offers about 3 ways of how to overcome too long command. Choose another and check.
It's in run configurations.

FSDataInputStream ClassNotFoundException in Spark

I am new to spark application programming, and therefore struggling here with this basic one..
I have scala ide and attached relevant jar files from the latest hadoop and spark distributions. There is just one basic scala object that i am working with -
hadoop - 2.7
spark - 2.0.0
I have attempted this with both scenarios, when hadoop processes are running on my laptop and also when they are not running.. its the same behaviour. Btw, spark shell is not complaining of anything
import org.apache.spark.SparkConf
object SparkAppTest {
def main(args : Array[String]) {
val conf = new SparkConf().setAppName("Spark test")
conf.setMaster("spark://master:7077")
conf.setSparkHome("/hadoop/spark")
conf.set("spark.driver.host","localhost")
}
}
When I am trying to "run" this using eclipse -> run as scala app this is failing with the following error -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:65)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:60)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at SparkAppTest$.main(SparkAppTest.scala:6)
at SparkAppTest.main(SparkAppTest.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more

NoSuchMethodError for Scala Seq line in Spark

I am having an error when trying to run plain Scala code in Spark similar to these posts: this and this
Their problem was that they were using the wrong Scala version to compile their Spark project. However, mine is the correct version.
I have Spark 1.6.0 installed on an AWS EMR cluster to run the program. The project is compiled on my local machine with Scala 2.11 installed and 2.11 listed in all dependencies and build files without any references to 2.10.
This is the exact line that throws the error:
var fieldsSeq: Seq[StructField] = Seq()
And this is the exact error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at com.myproject.MyJob$.main(MyJob.scala:39)
at com.myproject.MyJob.main(MyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark 1.6 on EMR is still built with Scala 2.10, so yes, you are having the same issue as in the posts you linked. In order to use Spark on EMR, you currently must compile your application with Scala 2.10.
Spark has upgraded their default Scala version to 2.11 as of Spark 2.0 (to be released within the next several months), so once EMR supports Spark 2.0, we will likely follow this new default and compile Spark with Scala 2.11.