Apache spark streaming - cache dataset for joining

Apache spark streaming - cache dataset for joining - streaming

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.

I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system

Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.

Related

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2

A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

How to iteratively bring a Dataframe back to the driver in Spark

Main question: How do you safely (without risking crashing due to OOM) iterate over every row (guaranteed every row) in a dataframe from the driver node in Spark? I need to control how big the data is as it comes back, operate on it, and discard it to retrieve the next batch (say 1000 rows at a time or something)
I am trying to safaely and iteratively bring the data in a potentially large Dataframe back to the driver program so that I may use the data to perform HTTP calls. I have been attempting use someDf.foreachPartition{makeApiCall(_)} and allowing the Executors to handle the calls. It works - but debugging and handling errors has proven to be pretty difficult when launching in prod envs, especially on failed calls.
I know there is someDf.collect() action, which brings ALL the data back to the driver all at once. However, this solution is not suggested, because if you have a very large DF, you risk crashing the driver.
Any suggestions?

if the data does not fit into memory, you could use something like :
df.toLocalIterator().forEachRemaining( row => {makeAPICall(row)})
but toLocalIterator has considerable overhead compared to collect
Or you can collect your dataframe batch-wise (which does essentially the same as toLocalIterator):
val partitions = df.rdd.partitions.map(_.index)
partitions.toStream.foreach(i => df.where(spark_partition_id() === lit(i)).collect().map(row => makeAPICall(row)))

It is a bad idea to bring all that data back to driver because driver is just 1 node and it will become the bottleneck. The scalability will be lost. If you had to do this then think twice if you really need a big data application? probably not.
dataframe.collect() is the best way to bring the data to driver and it will bring all data. The alternative is toLocalIterator which will bring data of the largest partition which can be big too. So this should be rarely used and for small amount of data only.
If you insist then you can write the output to a file or queue and read that file in a controlled manner. This will be partially scalable solution which I won't prefer.

How to access broadcasted DataFrame in Spark

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those
frequently in the SQL query by registering as table.But as those are big, it is taking much time
to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.
PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")
ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)
sqlContext.sql("JOIN Query").show
But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.
Can anyone tell if this is the right approach of broadcasting and using it?`

You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.
It can be hard to understand where the time is being spent if you evaluate everything at once.
You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.
// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")
// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()
// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)
// perform an action to force the evaluation
PC_ITM_DF.count()
Doing this will ensure that the dataframe is
loaded in memory (persist)
registered as temp table for use in your SQL query
marked as broadcast, so will be shipped to all executors
When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.

I would cache the rdds in memory. The next time they are needed, spark will read the RDD from memory rather than generating the RDD from scratch each time. Here is a link to the quick start docs.
val PC_ITM_DF = sqlContext.parquetFile("path")
PC_ITM_DF.cache()
PC_ITM_DF.registerAsTempTable("PC_ITM")
val ITM_SELL_DF=sqlContext.parquetFile("path")
ITM_SELL_DF.cache()
ITM_SELL_DF.registerAsTempTable("ITM_SELL")
sqlContext.sql("JOIN Query").show
rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY). There are a few levels of persistence you can choose from incase your data is too big for memory only persistence. Here is a list of persistence options. If you want to manually remove the RDD from the cache you can call rdd.unpersist().
If you prefer to broadcast the data. You must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executers).

At this moment you can not access broadcasted data frame in the SQL query. You can use brocasted data frame through only through data frames.
Refer: https://issues.apache.org/jira/browse/SPARK-16475

Modelling a mutable collection in Spark

Our existing application loads approximately ten million rows from a database into a collection of objects on startup. The collection is stored in a GigaSpaces cache.
As new messages are received by the application, the cache is checked to see if an entry for that message already exists. If not, a new entity is added to the cache based on the data in the message. (At the same time, the new entity is persisted to a database).
We are investigating the feasibility and value add of re-architecting the application using Spark and Scala. The question is, what would be the correct way to model this in Spark.
My first thought is to load from the database into a Spark RDD. Looking up existing entries would obviously be simple. However, because an RDD is immutable, adding new entries to the cache would require a transformation. Given the large set of data, my presumption is that this would not perform well.
The other idea is to create the cache as a mutable Scala collection. However, how would we then integrate this with Spark, given that Spark works with RDD's?
Thanks

This is more of a design questions. Spark is not great for fast lookups. It is optimize for batch jobs that need to touch almost the entire dataset; potentially multiple times.
If you want something that has fast search-like capabilities you should look into Elastic Search.
Other technologies that are often used for storing large in-memory/lookup tables is redis and memcached.

Since RDDs are immutable, every single cache update would require producing an entirely new RDD from your previous RDD. This is clearly inefficient (you have to manipulate the entire RDD just to update a tiny part of it). As for the other idea of having a mutable scala collection of RDD elements -- well, that won't be distributable across machines/CPUs, so what's the point?
If your goal is to have in-memory, distributable/partitionable operations on your cache, what you're looking for is an operational in-memory data grid, not Apache Spark. For example: Hazelcast, ScaleOut software, etc.
Apache Spark is notoriously bad at fine-grained transformations like the ones you would need for an in-memory distributed cache.
Sorry if I'm not directly answering the technical question, instead I'm answering your question behind your question...

Custom scalding tap (or Spark equivalent)

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.
What I would like to do is more or less the following:
start from a distributed list of records, such as a Scalding pipe or similar
group items by some computed function
make so that items belonging to the same group reside on the same server
on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.
I would like to implement the above with Scalding, but I am not sure how to do the last step.
While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.
I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.
Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.

Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.
Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.
In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:
val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..