spark streaming save base64 rdd to json on s3 - scala

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?

So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

Related

Java Class not Found Exception while doing Spark-submit Scala using sbt

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

File not found exception while loading a properties file on a Scala SBT project

I am trying to learn a Scala-Spark JDBC program on IntelliJ IDEA. In order to do that, I have created a Scala SBT Project and the project structure looks like:
Before writing the JDBC connection parameters in the class, I first tried loading a properties file which contain all my connection properties and trying to display if they are loading properly as below:
connection.properties content:
devUserName=username
devPassword=password
gpDriverClass=org.postgresql.Driver
gpDevUrl=jdbc:url
Code:
package com.yearpartition.obj
import java.io.FileInputStream
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.SparkConf
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val properties = new Properties()
properties.load(new FileInputStream("connection.properties"))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName=properties.getProperty("devUserName")
val devPassword=properties.getProperty("devPassword")
val gpDriverClass=properties.getProperty("gpDriverClass")
println("connectionUrl: " + connectionUrl)
Class.forName(gpDriverClass).newInstance()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().config(conf).master("local[2]").getOrCreate()
println("connectionUrl: " + connectionUrl)
}
}
Content of build.sbt:
name := "YearPartition"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkCoreVer = "2.2.0"
val sparkSqlVer = "2.2.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkCoreVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkSqlVer % "provided" withSources(),
"org.json4s" %% "json4s-jackson" % "3.2.11" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.3"
)
}
Since I am not writing or saving data into any file and trying to display the values of properties file, I executed the code using following:
SPARK_MAJOR_VERSION=2 spark-submit --class com.yearpartition.obj.PartitionRetrieval yearpartition_2.11-0.1.jar
But I am getting file not found exception as below:
Caused by: java.io.FileNotFoundException: connection.properties (No such file or directory)
I tried to fix it in vain. Could anyone let me know what is the mistake I am doing here and how can I correct it ?
You must write to full path of your connection.properties file (file:///full_path/connection.properties) and in this option when you submit a job in cluster if you want to read file the local disk you must save connection.properties file on the all server in the cluster to same path. But in other option, you can read the files from HDFS. Here is a little example for reading files on HDFS:
#throws[IOException]
def readFileFromHdfs(file: String): org.apache.hadoop.fs.FSDataInputStream = {
val conf = new org.apache.hadoop.conf.Configuration
conf.set("fs.default.name", "HDFS_HOST")
val fileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val path = new org.apache.hadoop.fs.Path(file)
if (!fileSystem.exists(path)) {
println("File (" + path + ") does not exists.")
null
} else {
val in = fileSystem.open(path)
in
}
}

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterUtils$

I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount
I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share

Apache spark-streaming application output not forwarded to the master

I'm trying to run the FlumeEvent example which is the following
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import org.apache.spark.streaming.flume.FlumeUtils
object FlumeEventCount {
def main(args: Array[String]) {
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
.set("spark.cleaner.ttl","3")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
var stream = FlumeUtils.createStream(ssc, "192.168.1.5",3564, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received flume events." + cnt ).print()
stream.count.print()
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
My sbt file is the following
import AssemblyKeys._
assemblySettings
name := "flume-test"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" % "1.0.0" exclude("org.apache.spark","spark-core") exclude("org.apache.spark", "spark-streaming_2.10")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
I run the programm with the following command
/tmp/spark-1.0.0-bin-hadoop2/bin/spark-submit --class FlumeEventCount --master local --deploy-mode client /tmp/fooproj/target/scala-2.10/cert-log-manager-assembly-1.0.jar
On the other side, the flume application is sending everything correctly and I can see in the logs that it's received.
I haven't made any changes to spark's configuration nor setup any environment variables, I just downloaded and unpacked the program.
Can someone tell me what am I doing wrong?
//edit: When I execute spark's FlumeEventCount example, it works
//edit2: If I remove the awaiTermination and add an ssc.stop it prints everything one single time, I guess this happens because something is getting flushed
....I should have learned to rtfm more carefully by now,
quoting from this page: https://spark.apache.org/docs/latest/streaming-programming-guide.html
// Spark Streaming needs at least two working thread
val ssc = new StreamingContext("local[2]", "NetworkWordCount", Seconds(1))
I've been launching spark with only one thread
also the following works fine
stream.map(event=>"Event: header:"+ event.event.get(0).toString+" body:"+ new String(event.event.getBody.array) ).print