Network Spark Streaming from multiple remote hosts - scala

I wrote program for Spark Streaming in scala. In my program, i passed 'remote-host' and 'remote port' under socketTextStream.
And in the remote machine, i have one perl script who is calling system command:
echo 'data_str' | nc <remote_host> <9999>
In that way, my spark program is able to get data, but it seems little bit confusing as i have multiple remote machines which needs to send data to spark machine.
I wanted to know the right way of doing it. Infact, how will i deal with data coming from multiple hosts?
For Reference, My current program:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val inputStream = ssc.socketTextStream(<remote-host>, 9999)
-------------------
-------------------
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
}
Thanks in advance.

You can find more information from "Level of Parallelism in Data Receiving".
Summary:
Receiving multiple data streams can therefore be achieved by creating
multiple input DStreams and configuring them to receive different
partitions of the data stream from the source(s);
These multiple DStreams can be unioned together to create a single
DStream. Then the transformations that were being applied on a single
input DStream can be applied on the unified stream.

Related

Transform and persist Spark DStream into several separate locations, in parallel?

I have an use case of a DStream that contains data with several levels of nesting, and I have a requirement to persist different elements from that data into separate HDFS locations. I managed to work this out by using Spark SQL, as below:
val context = new StreamingContext(sparkConf, Seconds(duration))
val stream = context.receiverStream(receiver)
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.drop("childRecords").write.parquet("ParentTable")
}
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.select(explode(col("childRecords")).as("children"))
.select("children.*").write.parquet("ChildTable")
}
// repeat as necessary if parent table has more different kinds of child records,
// or if child table itself also has child records too
The code works, but the only issue I have with it, is that the persistence runs sequentially - the first stream.foreachRDD has to complete before the second one starts, etc. What I'd like to see ideally is for the persistence job for ChildTable to start without waiting for ParentTable to finish, as they're writing to different locations and would not conflict. In reality, I have about 10 different jobs all waiting to complete sequentially, and would probably see a big improvement in execution time if I'd be able to run them all in parallel.

Spark sliding window performance

I've setup a pipeline for incoming events from a stream in Apache Kafka.
Spark connects to Kafka, get the stream from a topic and process some "simple" aggregation tasks.
As I'm trying to build a service that should have a low latency refresh (below 1 second) I've built a simple Spark streaming app in Scala.
val windowing = events.window(Seconds(30), Seconds(1))
val spark = SparkSession
.builder()
.appName("Main Processor")
.getOrCreate()
import spark.implicits._
// Go into RDD of DStream
windowing.foreachRDD(rdd => {
// Convert RDD of JSON into DataFrame
val df = spark.read.json(rdd)
// Process only if received DataFrame is not empty
if (!df.head(1).isEmpty) {
// Create a view for Spark SQL
val rdf = df.select("user_id", "page_url")
rdf.createOrReplaceTempView("currentView")
val countDF = spark.sql("select count(distinct user_id) as sessions from currentView")
countDF.show()
}
It works as expected. My concerns are about performance at this point. Spark is running on a 4 CPUs Ubuntu server for testing purpose.
The CPU usage is about 35% all the time. I'm wondering if the incomming data from the stream have let's say 500 msg/s how would the CPU usage will evolve? Will it grow exp. or in a linear way?
If you can share your experience with Apache Spark in that kind of situation I'd appreciate it.
Last open question is if I set the sliding window interval to 500ms (as I'd like) will this blow up? I mean, it seems that Spark streaming features are fresh and the batch processing architecture may be a limitation in really real time data processing, isn't it?

Stream files from HDFS using Apache Spark Steaming

How can I stream files already present in HDFS using apache spark?
I have a very specific use case where I have millions of customer data and I want to process them at a customer level using apache stream. Currently what I am trying to do is I am taking entire customer dataset and repartition it on customerId and creating 100 such partitions and ensuring unique customer multiple records to be passed in a single stream.
Now I have all the data present in HDFS location
hdfs:///tmp/dataset
Now using the above HDFS location I want to stream the files which will read the parquet file get the dataset. I have tried the following things but no luck.
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.sparkContext.textFile("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
ssc.start()
ssc.awaitTermination()
NOTE: This solution doesn't stream data it just reads data from HDFS
and
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.textFileStream("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
dstream.print()
ssc.start()
ssc.awaitTermination()
I am always getting 0 result. Is is possible to stream files from HDFS if is already present in HDFS where no new files are publishing.
TL;DR This functionality is not supported in spark as of now. The closest you can get is by moving the files into hdfs:///tmp/dataset after starting the streaming context.
textFileStream internally uses FileInputDStream which has an option newFilesOnly. But this does not process all existing files but only the files which were modified within one minute (set by config value spark.streaming.fileStream.minRememberDuration) before streaming context. As described in jira issue
when you set the newFilesOnly to false, it means this FileInputDStream would not only handle coming files, but also include files which came in the past 1 minute (not all the old files). The length of time defined in FileInputDStream.MIN_REMEMBER_DURATION.
Or
You could create an (normal) RDD out the existing files before you start the streaming context. Which can be used along with the stream RDD later.

How to implement SparkContext in Play for Scala

I have the following Play for Scala controller that wraps Spark. At the end of the method I close the context to avoid the problem of having more than one context active in the same JVM:
class Test4 extends Controller {
def test4 = Action.async { request =>
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val str = "count: " + data.count()
sc.close
Future { Ok(str) }
}
}
The problem that I have is that I don't know how to make this code multi-threaded as two users may access the same controller method at the same time.
UPDATE
What I'm thinking is to have N Scala programs receive messages through JMS (using ActiveMQ). Each Scala program would have a Spark session and receive messages from Play. The Scala programs will process requests sequentially as they read the queues. Does this make sense? Are there any other best practices to integrate Play and Spark?
Its better just move spark context move to new object
object SparkContext{
val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
}
Otherwise for every request new spark context is created according to your design and new JVM is started for each new spark context.
If we talk about best practices its really not good idea to use spark inside play project more better way is to create a micro service which have spark application and play application call this micro service these type of architecture is more flexible, scalable, robust.
I don't think is a good idea to execute Spark jobs from a REST api, if you just want to parallelize in your local JVM it doesn't make sense to use Spark since it is designed for distributed computing. It is also not design to be an operational database and it won't scale well when you execute several concurrent queries in the same cluster.
Anyway if you still want to execute concurrent spark queries from the same JVM you should probably use client mode to run the query in a external cluster. It is not possible to launch more than one session per JVM so I would suggest that you share the session in your service, close it just when you are finishing the service.

Spark streaming get data from Twitter and save to Cassandra

As mention my question i have a problem about it
Specifically my problem is connect with cassandra and coming data with streaming type.Because i have already connect cassandra and spark and also getting data from twitter.I did this,but seperatly.Now i want to when i getting data from twitter,write this a table which one keyspace.How can i do this?
My codes are there.
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.set("spark.cleaner.ttl", "5000")
.setMaster("local[2]").setAppName("myapp")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val tweetsA = TwitterUtils.createStream(ssc, None, Array("searchword"))
val tweets_mystring = tweetsA.filter(_.getText.contains("searchword2")).map(ttext => ttext.getText)
tweets_mystring.map(??????).saveToCassandra("mykeyspace", "mytable")
//i can't write my map function like where ????
ssc.start()
ssc.awaitTermination(60000)
ssc.checkpoint(checkpointDir)
}
}
Hey a little late on the response, but I would look into DataStax. It supports the combination of Spark Streaming and Cassandra really well. Easy to use software for streaming data into Cassandra and will continue to be supported as they currently have around $190 Million in investment money. Below is a quick example of it's use.
https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSave.html