I've setup a pipeline for incoming events from a stream in Apache Kafka.
Spark connects to Kafka, get the stream from a topic and process some "simple" aggregation tasks.
As I'm trying to build a service that should have a low latency refresh (below 1 second) I've built a simple Spark streaming app in Scala.
val windowing = events.window(Seconds(30), Seconds(1))
val spark = SparkSession
.builder()
.appName("Main Processor")
.getOrCreate()
import spark.implicits._
// Go into RDD of DStream
windowing.foreachRDD(rdd => {
// Convert RDD of JSON into DataFrame
val df = spark.read.json(rdd)
// Process only if received DataFrame is not empty
if (!df.head(1).isEmpty) {
// Create a view for Spark SQL
val rdf = df.select("user_id", "page_url")
rdf.createOrReplaceTempView("currentView")
val countDF = spark.sql("select count(distinct user_id) as sessions from currentView")
countDF.show()
}
It works as expected. My concerns are about performance at this point. Spark is running on a 4 CPUs Ubuntu server for testing purpose.
The CPU usage is about 35% all the time. I'm wondering if the incomming data from the stream have let's say 500 msg/s how would the CPU usage will evolve? Will it grow exp. or in a linear way?
If you can share your experience with Apache Spark in that kind of situation I'd appreciate it.
Last open question is if I set the sliding window interval to 500ms (as I'd like) will this blow up? I mean, it seems that Spark streaming features are fresh and the batch processing architecture may be a limitation in really real time data processing, isn't it?
Related
I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()
I would like to fetch 1 day data from Azure eventhub and apply some logic and copy it to cosmos DB. I am able to fetch the data from eventhub but the data is streaming. I need to fetch data only for a time window (lets say only for one day/ or for 5 hrs).
Below is the code which I tried to fetch data from Azure EventHub.
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
object FromEventHub{
val spark = SparkSession
.builder
.appName("FromEventHubToCosmos")
.getOrCreate()
import spark.implicits._
val connectionString = ConnectionStringBuilder()
.setNamespaceName("NAMESPACE_NAME")
.setEventHubName("EVENTHUB_NAME")
.setSasKeyName("KEY_NAME")
.setSasKey("KEY")
.build
val currTime = Instant.now
val ehConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEnqueuedTime(currTime.minus(Duration.ofHours(5))))
.setEndingPosition(EventPosition.fromEnqueuedTime(currTime))
val reader = spark
.read
.format("eventhubs")
.options(ehConf.toMap)
.load()
val newDF = reader.withColumn("Offset", $"offset".cast(LongType)).withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType)).withColumn("Timestamp", $"enqueuedTime".cast(LongType)).withColumn("Body", $"body".cast(StringType)).select("Offset", "Time (readable)", "Timestamp", "Body")
newDF.show()
}
I have used setStartingPosition 5hrs before, but in scala data keeps on streaming from eventhub. I just need data from event hub till the time the code is executed.
Is there any way to limit data from event hub using time window or any other ways ?
How to manage the data available in data frame to apply some logic.?
I also encountered this issue with a Spark job keeping on streaming and not completing. If this helps, the problem was solved by running the code in Azure Databricks notebook. Strange, but the job completes there. Databricks Community Edition, which is free, would also do.
You can remove the stream part and do it in a batch way. (setEndingPosition only work on batch queries like below).
val reader = spark
.read
.format("eventhubs")
.options(ehConf.toMap)
.load()
I developed spark streaming(reeiver approach) which is reading data from kafka and processing data and writing into elasticsearch.
same code was developed in java(now we are writing same code in spark scala) and when we are comparing with java performance, spark is not doing well.
What I have observed is when we are writing to ES, its taking time.
Here is my highlevel code:
val kafkaStreams: util.List[DStream[String]] = new util.ArrayList[DStream[String]]
for(i <- 0 until topic_threads){
var data = KafkaUtils.createStream(ssc,kafkaConf,topic).map(line => line._2)
kafkaStreams.add(data)
}
//below union improves the performance as per spark 1.6.2
documentation
val unifiedStream = ssc.union(kafkaStreams)
unifiedStream.persist(StorageLevel.MEMORY_ONLY)
if(flagY){
val dataES = unifiedStream.map(rdd => processData(rdd))
dataES.foreachRDD(rdd => {
ElasticUtils.saveToEs(rdd, index_Name, index_Type)
})
In processData method, I am just parsing the data which we have red from kafka.
Could anyone please let me know your experiene or suggestions to improve the spark steaming(scala) receiver approach performance.
Due to this low performance, batches are piling up and its increasing delay in batch scheduling..
I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.
I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.
I can load the postgres table with something like:
val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
"url" -> url,
"dbtable" -> query))
...
val broadcasted = sc.broadcast(data.collect())
And later I can cross it like this:
val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}
I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/
In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.
Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.
The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.
One workaround is to postpone the query to output operations for example: foreachRDD(func).