spark streaming - use previous calculated dataframe in next iteration - scala

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation

Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html

If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Related

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

In Spark structured streaming how do I output complete aggregations to an external source like a REST service

The task I am trying to perform is to aggregate the count of values from a dimension (field) in a DataFrame, perform some statistics like average, max, min, etc then output the aggregates to an external system by making an API call. I am using a watermark of say 30 seconds with a window size of 10 seconds. I made these sizes small to make it easier for me to test and debug the system.
The only method I have found for making API calls is to use a ForeachWriter. My problem is that the ForeachWriter executes at the partition level and only produces an aggregate per partition. So far I haven't found a way to get the rolled up aggregates other than to coalesce to 1 which is a way to slow for my streaming application.
I have found that if I use the file based sink such as the Parquet writer to HDFS that the code produces real aggregations. It also performs very well. What I really need is to achieve this same result but calling an API rather than writing to a file system.
Does anyone know how to do this?
I have tried this with Spark 2.2.2 and Spark 2.3 and get the same behavior.
Here is a simplified code fragment to illustrate what I am trying to do:
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()
I answered my own question. I found that repartitioning by the window start time did the trick. It shuffles the data so that all rows with the same group and windowStart time are on the same executor. The code below produces a file for each group window interval. It also performs quite well. I don't have exact numbers but it produces aggregates in less time than the window interval of 10 seconds.
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.repartition($"windowStart") // <-------- this line produces the desired result
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!
This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:
Where will that data be printed out?
What node has priority and what partition?
If all nodes are running in parallel, who will be printed first?
How will be this print queue created?
Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).
This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example:
// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)
def printStuff(x:Int):Int = {
println(x)
x + 1
}
// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)
// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)
I was able to work it around by making a utility function:
object PrintUtiltity {
def print(data:String) = {
println(data)
}
}

Memory efficient way of union a sequence of RDDs from Files in Apache Spark

I'm currently trying to train a set of Word2Vec Vectors on the UMBC Webbase Corpus (around 30GB of text in 400 files).
I often run into out of memory situations even on 100 GB plus Machines. I run Spark in the application itself. I tried to tweak a little bit, but I am not able to perform this operation on more than 10 GB of textual data. The clear bottleneck of my implementation is the union of the previously computed RDDs, that where the out of memory exception comes from.
Maybe one you have the experience to come up with a more memory efficient implementation than this:
object SparkJobs {
val conf = new SparkConf()
.setAppName("TestApp")
.setMaster("local[*]")
.set("spark.executor.memory", "100g")
.set("spark.rdd.compress", "true")
val sc = new SparkContext(conf)
def trainBasedOnWebBaseFiles(path: String): Unit = {
val folder: File = new File(path)
val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par
var i = 0;
val props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("nthreads","2")
val pipeline = new StanfordCoreNLP(props);
//preprocess files parallel
val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
//preprocess line of file
println(file.getName() +"-" + file.getTotalSpace())
val rdd_lines: Iterator[Option[Seq[String]]] = for (line <- Source.fromFile(file,"utf-8").getLines) yield {
//performs some preprocessing like tokenization, stop word filtering etc.
processWebBaseLine(pipeline, line)
}
val filtered_rdd_lines = rdd_lines.filter(line => line.isDefined).map(line => line.get).toList
println(s"File $i done")
i = i + 1
sc.parallelize(filtered_rdd_lines).persist(StorageLevel.MEMORY_ONLY_SER)
})
val rdd_file = sc.union(training_data_raw.seq)
val starttime = System.currentTimeMillis()
println("Start Training")
val word2vec = new Word2Vec()
word2vec.setVectorSize(100)
val model: Word2VecModel = word2vec.fit(rdd_file)
println("Training time: " + (System.currentTimeMillis() - starttime))
ModelUtil.storeWord2VecModel(model, Config.WORD2VEC_MODEL_PATH)
}}
}
Like Sarvesh points out in the comments, it is probably too much data for a single machine. Use more machines. We typically see the need for 20–30 GB of memory to work with a file of 1 GB. By this (extremely rough) estimate you'd need 600–800 GB of memory for the 30 GB input. (You can get a more accurate estimate by loading a part of the data.)
As a more general comment, I'd suggest you avoid using rdd.union and sc.parallelize. Use instead sc.textFile with a wildcard to load all files into a single RDD.
Have you tried getting word2vec vectors from a smaller corpus? I tell you this cause I was running the word2vec spark implementation on a much smaller one and I got issues with it cause there is this issue: http://mail-archives.apache.org/mod_mbox/spark-issues/201412.mbox/%3CJIRA.12761684.1418621192000.36769.1418759475999#Atlassian.JIRA%3E
So for my use case that issue made the word2vec spark implementation a bit useless. Thus I used spark for massaging my corpus but not for actually getting the vectors.
As other suggested stay away from calling rdd.union.
Also I think .toList will probably gather every line from the RDD and collect it in your Driver Machine ( the one used to submit the task) probably this is why you are getting out-of-memory. You should totally avoid turning the RDD into a list!