Rank a Spark stream dataset column - scala

Im using Spark 2.3.1's Structured Streaming API. Is it possible to rank values in a column of a spark stream dataframe? I tried using the following code to then realize after the exception message that its not possible for stream context to iterate over the entire window.
.withColumn("rank", row_number().over(Window.orderBy($"transactionTime")))
throws
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets
Can anyone help me with an idea to calculate rank/percentile?

So it seems that window operations are not supported in spark structured streaming api yet.
Look forward to the upcoming releases from Apache Spark

Yes, unfortunately there is no useful API to do what you need, although I tried workaround using Scala groupBy and mapGroupWithState, e.g.:
val stream = ...
stream
.groupByKey(_.id)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(<function>)
and <function> will receive Iterator over your data. You may sort it and implement rank, dense_rank etc.
However, you have requested window without partition key information(which will lead to OOM issues for huge data volume), in this case, you may add same value for all records using withColumn.
Note: you don't need to keep state in GroupState, you just need API to do what you need.
Hope it helps!

Related

How to use a non-time-based window with spark data streaming structure?

I am trying to use window on structured streaming with spark and kafka.
I use window on non-time-based data, so I get this error:
'Non-time-based windows are not supported on streaming DataFrames/Datasets;;\nWindow
Here is my code:
window = Window.partitionBy("input_id").orderBy("similarity")
outputDf = inputDf\
.crossJoin(ticketDf.withColumnRenamed("IDF", "old_IDF")) \
.withColumn("similarity", cosine_similarity_udf(col("IDF"), col("old_IDF"))) \
.withColumn("rank", rank().over(window)) \
.filter(col("rank") < 10)
So I am looking for a tip or a reference to use window on non-time-based data...
The traditional SQL windowing with over() is not supported in Spark Structured Streaming (The only windowing it supports is time-based windowing). If you Think about it, it is probably to avoid confusions. Some may falsely assume that Spark Structured Streaming can partition the whole data based on a column (it is impossible because streams are unbounded input data).
You instead can use groupBy().
groupBy() is also a state-full operation which is impossible to implement on append mode, unless we include a timestamp column in the list of columns that we want to do a groupBy operation on. For example:
df_result = df.withWatermark("createdAt", "10 minutes" ) \
.groupBy( F.col('Id'), window(F.col("createdAt"), self.acceptable_time_difference)) \
.agg(F.max(F.col('createdAt')).alias('maxCreatedAt'))
In this example createdAt is a timestamp typed column. Please note that in this case, we have to call withWatermrke on the timestamp column beforehand, because Spark cannot store the states boundlessly.
ps: I know groupBy does not function exactly like windowing, but with a simple join or custom function with mapGroupsWithState, you may be able to implement the desired functionality.
Windows always needs time-based data, but Spark Structured Streaming no.
You can create Spark Structured Streaming with the trigger "as_soon_as_posible" and you can group the data by window, the group is on time.
Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
Indeed the window is only based on time...
For the application I avoid avoid the use of flask. I have looked for a long time to a streaming system... and now I am using Kafka, and it rocks for my application ! :)
And I have this resource to share with you about the Unsupported Operations with the structured streaming : https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations

Did spark dataframes load parquet data lazily?

I want to run sql on my parquet data in spark using the following code,
val parquetDF = spark.read.parquet(path)
parquetDF.createOrReplaceTempView("table_name")
val df = spark.sql("select column_1, column_4, column 10 from table_name");
println(df.count())
My question is, Does this code read only the required columns from the disc?
Theoretically the answer should be Yes. But I need an expert opinion because in the case of Jdbc queries (Mysql),
the read(spark.read) phase is taking more time when compared to actions(may be relates to connection but not sure). Jdbc code follows,
spark.read.format("jdbc").jdbc(jdbcUrl, query, props).createOrReplaceTempView(table_name)
spark.sql("select column_1, column_4, column 10 from table_name");
df.show()
println(df.count())
If someone can explain the framework flow in both the cases, it will be very helpful.
Spark version 2.3.0
Scala version 2.11.11
In both cases Spark will do its best (exact behavior depends on format and version. Depending on the context some optimizations might not be applied, typically with deeply nested data) to limit traffic to only required data. In fact spark.sql("select ...) part is not even relevant, as actual query should be limited to something equivalent to SELECT 1 FROM table, for a given format.
This stays true, as long as you don't use cache / persist. If you do, all optimizations go away, and Spark will load all data eagerly (see my answer to Any performance issues forcing eager evaluation using count in spark? and Caching dataframes while keeping partitions. Also here is an example how execution plan changes when cache is used.).

How to access broadcasted DataFrame in Spark

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those
frequently in the SQL query by registering as table.But as those are big, it is taking much time
to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.
PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")
ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)
sqlContext.sql("JOIN Query").show
But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.
Can anyone tell if this is the right approach of broadcasting and using it?`
You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.
It can be hard to understand where the time is being spent if you evaluate everything at once.
You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.
// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")
// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()
// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)
// perform an action to force the evaluation
PC_ITM_DF.count()
Doing this will ensure that the dataframe is
loaded in memory (persist)
registered as temp table for use in your SQL query
marked as broadcast, so will be shipped to all executors
When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.
I would cache the rdds in memory. The next time they are needed, spark will read the RDD from memory rather than generating the RDD from scratch each time. Here is a link to the quick start docs.
val PC_ITM_DF = sqlContext.parquetFile("path")
PC_ITM_DF.cache()
PC_ITM_DF.registerAsTempTable("PC_ITM")
val ITM_SELL_DF=sqlContext.parquetFile("path")
ITM_SELL_DF.cache()
ITM_SELL_DF.registerAsTempTable("ITM_SELL")
sqlContext.sql("JOIN Query").show
rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY). There are a few levels of persistence you can choose from incase your data is too big for memory only persistence. Here is a list of persistence options. If you want to manually remove the RDD from the cache you can call rdd.unpersist().
If you prefer to broadcast the data. You must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executers).
At this moment you can not access broadcasted data frame in the SQL query. You can use brocasted data frame through only through data frames.
Refer: https://issues.apache.org/jira/browse/SPARK-16475

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.