How to save spark streaming data in cassandra - scala

build.sbt
Below are the contents included in build.sbt file
val sparkVersion = "1.6.3"
scalaVersion := "2.10.5"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"
Command to initialize shell:
The below command is the shell initialization procedure I followed
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
Note:
Here I specified jar specifically because SBT couldn’t fetch the required libraries of spark streaming kafka used at creation of kafkaStream in later sections
Import required libraries:
This section includes libraries to be imported that are used in various cases of the REPL session
import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;
Setting up Spark Streaming Configuration:
Here am configuring configurations required for spark streaming context
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
conf.set("spark.driver.allowMultipleContexts", "true"); // Required to set this to true because during // shell initialization or starting we a spark context is created with configurations of highlighted
conf.setMaster("local"); // then we are assigning those cofigurations locally
Creation of SparkStreamingContext using above configurations:
Using configurations defined above we create a spark streaming context in the below way
val ssc = new StreamingContext(conf, Seconds(1)); // Seconds here describe the interval to fetch
Creating a Kafka stream using above Spark Streaming Context aka SSC:
Here ssc is spark streaming context that was created above,
“localhost:2181” is ZKquoram
"spark-streaming-consumer-group" is consumer group
Map("test3" -> 5) is Map(“topic” -> number of partitions )
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("test3" -> 5)).map(_._2)
Note
Values fetched when the kafkaStream object is printed, using kafkaStream.print() are shown in below image
85052,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85053,19,167.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85054,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85055,19,167.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85056,19,960.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85057,19,167.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85058,19,960.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
17/09/02 18:25:25 INFO JobScheduler: Finished job streaming job 1504376716000 ms.0 from job set of time 1504376716000 ms
17/09/02 18:25:25 INFO JobScheduler: Total delay: 9.661 s for time 1504376716000 ms (execution: 0.021 s)
17/09/02 18:25:25 INFO JobScheduler: Starting job streaming job 1504376717000 ms.0 from job set of time 1504376717000 ms
Transforming the kafkaStream and saving in Cassandra:
kafkaStream.foreachRDD( rdd => {
if (! rdd.isEmpty()) {
rdd.map( line => {
val arr = line.split(",");
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6), arr(7), arr(8), arr(9), arr(10), arr(11))
}). saveToCassandra("test", "sensorfeedVals", SomeColumns(
"tableid", "ccid", "paramval", "batVal", "time", "gwid", "gwhName", "snid", "snhName", "snStatus", "sd", "MId")
)
} else {
println("No records to save")
}
}
)
Start ssc:
Using ssc.start you can start the streaming
Issues am facing here are:
1. Printing of the content of stream is happening only after I enter exit or Ctrl+C
2. Whenever I use ssc.start does it start streaming immediately In REPL? Without giving time to enter ssc.awaitTermination
3. Main issue when I tried to save normally in below procedure ***
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
am able to save in Cassandra but whenever am trying to save in Cassandra using the logic shown in Transforming the kafkaStream and saving in Cassandra: I couldn't extract each value from string and save it in respective columns of Cassandra tables!

java.lang.NoClassDefFoundError: Could not initialize class com.datastax.spark.connector.cql.CassandraConnector
Means the classpath has not been correctly setup for your application. Make sure you are using the --packages option when launching your application as is noted in the SCC Docs
For your other issues
You don't need awaitTermination in the REPL because the repl will not instantly quit after starting the streaming context. That call is there for an application which may have no further instructions to prevent the main thread from exiting.
Start will start the streaming immediately.

A line or two lines of code which related to contexts is causing the issue here!
I found the solution when i walked through the topics of context!
Here I was running multiple contexts but they are independent to each other.
I have initialized shell with below command:
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
So when shell starts A spark context with properties of Datastax connector are initialized.
Later I created some configurations and using those configurations created a spark streaming context. Using this context I have created kafkaStream. This kafkaStream is having only properties of SSC but not SC, so here raised the issue of storing in to cassandra.
I have tried to resolve it in the below and succeeded!
val sc = new SparkContext(new SparkConf().setAppName("Spark-Kafka-Streaming").setMaster("local[*]").set("spark.cassandra.connection.host", "127.0.0.1"))
val ssc = new StreamingContext(sc, Seconds(10))
Thanks everyone who came forward to support!
Let me know if any more best possible ways to achieve it!

A very simple approach is to convert a stream as a dataframe for foreachRDD API, convert the RDD to DataFrame and save to cassandra using SparkSQL-Cassandra Datasource API. Below is a simple code snippet where I am saving the Twitter tweets to a Cassandra Table
stream.foreachRDD(rdd => {
if (rdd.count() > 0) {
val data = rdd.filter(status => status.getLang.equals("en")).map(status => TweetsClass(status.getId,
status.getCreatedAt.toGMTString(),
status.getUser.getLocation,
status.getText)).toDF()
//Save the data to Cassandra
data.write.
format("org.apache.spark.sql.cassandra").
options(Map(
"table" -> "sentiment_tweets",
"keyspace" -> "My Keyspace",
"cluster" -> "My Cluster")).mode(SaveMode.Append).save()
}
})

Related

Spark read job from gcs object stuck

I'm trying to read an object with a spark job locally. I previously created with another Spark job locally.
When looking at the logs I see nothing weird, and in the spark UI the job is just stuck
Before I kick the read job I update the spark config as follows:
val hc = spark.sparkContext.hadoopConfiguration
hc.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hc.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hc.set("fs.gs.project.id", credential.projectId)
hc.set("fs.gs.auth.service.account.enable", "true")
hc.set("fs.gs.auth.service.account.email", credential.email)
hc.set("fs.gs.auth.service.account.private.key.id", credential.keyId)
hc.set("fs.gs.auth.service.account.private.key", credential.key)
Then I simply read like this
val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId)
spark.read.format("csv")
.options(options)
.load(path)
My service account has those permissions, I literally added all permissions I could find for Object storage
Storage Admin
Storage Object Admin
Storage Object Creator
Storage Object Viewer
This is how I previously wrote the object
val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId, "header" -> "true")
var writer = df.write.format("csv").options(options)
writer.save(path)
Those are my dependencies
Seq(
"org.apache.spark" %% "spark-core" % "3.1.1",
"org.apache.hadoop" % "hadoop-client" % "3.3.1",
"com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.23.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.4",
"com.google.cloud" % "google-cloud-storage" % "2.2.1"
)
Any idea why would the write succeed but the read stuck like this?
I was using a version of the dependencies that was not the latest. Once I've updated google connector dependencies to the latest version (December 2021) I got the read working as well as the write from Google Storage.

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance
Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Running Apache Spark Example Application in IntelliJ Idea

I am trying to run the SparkPi.scala example program in Netbeans. Unfortunately I am quite new to Spark and have not been able to execute it successfully.
My preference is to work in Netbeans only and execute from there. I know spark also allows executing from the spark console - I however prefer not to take that approach.
This is my build.sbt file contents:
name := "SBTScalaSparkPi"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
This is my plugins.sbt file contents:
logLevel := Level.Warn
This is the program I am trying to execute:
import scala.math.random
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
JDK version: 1.8.
The error I get when trying to execute the code is given below:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/25 07:50:25 INFO SparkContext: Running Spark version 1.6.1
16/03/25 07:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/25 07:50:26 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:401)
at SparkPi.main(SparkPi.scala)
16/03/25 07:50:26 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>
at SparkPi$.main(SparkPi.scala:28)
at SparkPi.main(SparkPi.scala)
Process finished with exit code 1
Thanks in advance for any help.
A master URL must be set in your configuration
You must set a spark.master in your SparkConf. There are only two mandatory parameters you must set - the master and the AppName that you've already set. For more details, see Initializing Spark section in the docs.
Which master should you use? See Master URLs section for all options. The simplest option for testing is local, which runs an entire Spark system (driver, master, worker) on your local machine, with no extra configuration.
To set the master through the Scala API:
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
The start of your program just lacks the URL that points to the Spark master endpoint. You can specify this as a command line parameter in InteliJ. The master URL is the URL and port where the Spark master of your cluster is running. An example command line parameter looks like this:
-Dspark.master=spark://myhost:7077
See the answer to this question for details:
How to set Master address for Spark examples from command line
Perhaps for your first runs you want to just start a local Spark standalone environment. How to get that running is well documented here: http://spark.apache.org/docs/latest/spark-standalone.html
If you got this running you can setup your spark master config like this:
-Dspark.master=spark://localhost:7077
The Master URL need to be set. Using the setMaster("local") function / method solved the issue.
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)
As a matter of fact both #Matthias and #Tzach are right. You should choose your solution based on what is easier for you (maybe prefer the first option for now). As soon as you start running your spark job on a real cluster it is far better to not hardcode the "master" parameter so that you can run your spark job in multiple cluster mode (YARN, Mesos, Standalone with spark-submit) and still keep it running locally with Netbeans (-Dspark.master=local[*])