I have a Spark app written in Scala that is writing and reading from Parquet files.
The app exposes an HTTP API, and when it receives requests, sends work to a Spark cluster through a long-lived context that is persisted through the app's life.
It then returns the results to the HTTP client.
This all works fine when I'm using a local mode, with local[*] as master.
However, as soon as I'm trying to connect to a Spark cluster, I'm running into serialization issues.
With Spark's default serializer, I get the following:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.FilterExec.otherPreds of type scala.collection.Seq in instance of org.apache.spark.sql.execution.FilterExec.
If I enable Kryo serializer, I get java.lang.IllegalStateException: unread block data.
This happens when trying to read from the Parquet files, however I don't believe it has anything to do with the Parquet files themselves, simply with the serialization of the code that's being sent over to the Spark cluster.
From a lot of internet searches, I've gathered that this could be caused by incompatibilities between Spark versions, or even Java versions.
But the versions being used are identical.
The app is written in Scala 2.12.8 and ships with Spark 2.4.3.
The Spark cluster is running Spark 2.4.3 (the version compiled with Scala 2.12).
And the machine on which both the Spark cluster and the app are running is using openJDK 1.8.0_212.
According to another internet search, the problem could have been because of a mismatch in the spark.master URL.
So I've set spark.master in spark-defaults.conf to the same value I'm using within the app to connect to it.
However, this hasn't solved the issue and I am now running out of ideas.
I am not entirely sure what the underlying explanation is, but I fixed it by copying my application's jar into Spark's jars directory. Then I was still encountering an error, but a different one: something about a Cats/kernel/Eq class missing. So I added cats-kernel's jar into Spark's jars directory.
And now everything works. Something I read in another Stack Overflow thread may explain it:
I think that whenever you do any kind of map operation using a lambda which is referring to methods/classes of your project, you need to supply them as an additional jar. Spark does serializes the lambda itself, but is not pulling together its dependencies. Not sure why the error message is not informative at all.
Related
We have a third-party jar file on which our Spark application is dependent that. This jar file size is ~15MB. Since we want to deploy our Spark application on a large-scale cluster(~500 workers), we are concerned about distributing the third-party jar file. According to the Apache Spark documentation(https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management), we have such options as using HDFS, HTTP server, driver HTTP server, and local-path for distributing the file.
We do not prefer to use local-path because it requires copying the jar file on all workers' spark libs directory. On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
If you put the 3rd jar in HDFS why it effect the spark driver server?!
each node should take the addintiol jar directly from the hdfs not from the spark server.
After examining different proposed methods, we found out that the best way to distribute the jar files is to copy all nodes(as #egor mentioned, too.) based on our deployment tools, we released a new spark ansible role that supports external jars. At this moment, Apache Spark does not provide any optimized solution. If you give a remote URL (HTTP, HTTPS, HDFS, FTP) as an external jar path, spark workers fetch the jar file every time a new job submit. So, it is not an optimized solution from a network perspective.
I am currently using spark to write my dimensional data model and we are currently uploading the jar to an AWS EMR cluster to test. However, this is tedious and time consuming for testing and building tables.
I would like to know what others are doing to speed up their development. The possibilities I came across in my research is running spark jobs directly from the IDE with Intellij Idea and I would like to know other development processes that are being used where it's faster to develop.
The ways I have had tried till now are:
Installing spark and hdfs on two or three commodity PCs and test the code before submitting it on the cluster.
Running the code on the single node to avoid dummy mistakes.
Submitting the jar file on the cluster.
The similar part in the first and third method is making the jar file which may takes a lot of time. The second one is not suitable to find and fix the bugs and problems and raise on distributed running environments.
I'm trying to build a Kafka consumer in Scala using IntelliJ to read messages from a Kafka topic and save them on hdfs. I'm using spark 1.6.2, kafka_2.10-0.10, scala 2.10.5 with hdp 2.5.3. I'm getting the error below:
Exception in thread "main" java.lang.NoSuchMethodError: kafka.consumer.SimpleConsumer.<init>(Ljava/lang/String;IIILjava/lang/String;Lorg/apache/kafka/common/protocol/SecurityProtocol;)V
From my research on here, I've learned that it's a jar/dependency issue but I'm still not able to resolve it.
You have to make sure that kafka libraries are available for Spark runtime. There are several ways to make sure of this:
Invoking spark shell or spark-submit with --jars "/location/of/your/kafka-jar"
Copy kafka related jars into your spark installation's "jars" folder. (Note: If you running on cluster, you have to copy these jars to all the nodes. So, I recommend above method where spark does this for you internally)
How many ways are there to run Spark? If I just declare dependencies in build.sbt, Spark is supposed to be downloaded and works?
But if I want to run Spark locally (download the Spark tar file, winutils...), how can I specify in scala code that I want to run my code against the local Spark and not against the dependencies downloaded in the IntelliJ?
In order to connect Spark to Cassandra, do I need a local installation of Spark? I read somewhere it's not possible to connect from a "programmatically" Spark to a local Cassandra database
1) Spark runs in a slightly strange way, there is your application (the Spark Driver and Executors) and there is the Resource Manager (Spark Master/Workers, Yarn, Mesos or Local).
In your code you can run against the in process manager (local) by specifying the master as local or local[n]. The Local mode requires no installation of Spark as it will be automatically setup in the process you are running. This would be using the dependencies you downloaded.
To run against a Spark Master which is running locally, you use a spark:// url that points at your particular local Spark Master instance. Note that this will cause executor JVMS to start separate from your application necessitating the distribution of application code and dependencies. (Other resource managers have there own identifying urls)
2) You do not need a "Resource Manager" to connect to C* from Spark but this ability is basically for debugging and testing. To do this you would use the local master url. Normal Spark usage should have an external Resource Manager because without an external resource manager the system cannot be distributed.
For some more Spark Cassandra examples see
https://github.com/datastax/SparkBuildExamples
I actually want to know the underlying mechanism of how this happens that when I execute sbt run the spark application starts !
What is the difference between this and running spark on standalone mode and then deploying application on it using spark-submit.
If someone can explain how the jar is submitted and who makes the task and assigns it in both the cases, that would be great.
Please help me out with this or point to some read where i can make my doubts cleared !
First, read this.
Once you are familiar with the terminologies, different roles, and their responsibilities, read below paragraph to summarize.
There are different ways to run a spark application(a spark app is nothing but a bunch of class files with an entry point).
You can run the spark application as single java process(usually for development purposes). This is what happens when you run sbt run.
In this mode, all the services like driver, workers etc are run inside a single JVM.
But above way of running is only for development and testing purposes as it won't scale. That means you won't be able to process a huge amount of data. This is where other ways of running a spark app come into the picture(Standalone, mesos, yarn etc).
Now read this.
In these modes, there will be dedicated JVMs for different roles. Driver will be running as a separate JVM, there could be 10s to 1000s of executor JVMs running on different machines(Crazy right!).
The interesting part is, the same application that runs inside a single JVM will be distributed to run on 1000s of JVMs. This distribution of the application, life-cycle of these JVMs, making them fault-tolerance etc are taken care by Spark and the underlying cluster frameworks.