How to get the row top 1 in Spark Structured Streaming?

How to get the row top 1 in Spark Structured Streaming? - scala

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)

TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.

Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

Related

Joing large RDDs in scala spark

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple join, which takes the value of second column and add it to the final output along with the large data processed output. But this job is running longer for more than 1 day. How do I optimize it? I tried to refer some solutions like Refer. But the solution are for spark dataframe. How do I optimize it for RDD?
Large dataset
1,large_blob_of_info
2,large_blob_of_info
3,large_blob_of_info
4,large_blob_of_info
5,large_blob_of_info
6,large_blob_of_info
Medium size data
3,23
2,45
1,67
4,89
Code that I have have.
rdd1.join(rdd2).map(a => a.x)
val result = input
.map(x => {
val row = x.split(",")
(row(0), row(2))
}).filter(x=> x._1 != null !x._1.isEmpty)

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1].rdd.map(v => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2].rdd.map(v => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
t3List
}
outRdd.write....
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1

And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
spark.table("t1")
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

Spark-Scala: Incremental Data load in Spark Scala along with generation of Unique Id

I am using zipWithIndex to generate sequence_number and add it as a separate column.
I am using code similar to below:
val file = sparkSession.createDataFrame(lexusmasterrdd,structSchema)
val filerdd=file.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((((indexedRow._2.toLong+1)).toLong) +: indexedRow._1.toSeq))
val newSchema=StructType(Array(StructField("Sequence_number",LongType,true)).++(file.schema.fields))
val finalDF=sparkSession.createDataFrame(filerdd,newSchema)
I am now trying to come up with a logic for incremental load for the same.
A simple load where new data is appended to existing data and sequence numbers are generated from last generated number.
One way to achieve this by getting the max(Sequence_number) and then adding along with a row_number() function for new data.
But is there any other way in which i can make use of zipWithIndex in incremental load?
Some code would be helpful.
I am using Spark 2.3 with Scala

One way to achieve this by getting the max(Sequence_number) and then
adding along with a row_number() function for new data.
This would work, but does not scale because row_number() would need to shuffle all records into 1 partition. I would rather use monotonically_increasing_id():
//get max from "old" data
val prevMaxId = oldDf.select(max($"Sequence_number")).as[Long].head()
val addUniqueID : Column = monotonically_increasing_id() + prevMaxId
val finalDF = newDF.withColumn("Sequence_number",addUniqueID)
if you want to use zipWithIndex, you could something similar:
//get max from "old" data
val prevMaxId = oldDf.select(max($"Sequence_number")).as[Long].head()
val finalRDD = oldRdd.zipWithIndex().map{case (data,id) => (data, id+prevMaxId)}

Apache Spark Count by Group Method

I want to get a listing of values and counts for a specific column (column "a") in a Cassandra table using Datastax and Spark, but I'm having trouble determining the correct method of performing that request.
I'm essentially trying to do the equivalent of a T-SQL
SELECT a, COUNT(a)
FROM mytable
I've tried the following using datastax and spark on Cassandra
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").select("a")
rdd.groupBy(row => row.getString("a")).count()
This looks to just give me the count of distinct values in the a column, but I was more after a listing of the values and the counts of those values (so val1:10 ... val2:5 ... val3:12 ... and so forth. I've tried some .collect and similar; just not sure how to get the listing there; any help would be appreciated.

The code snippet below will fetch the partition key named "a" and get the column with "column_name" and find the number of count for that.
val cassandraPartitionKeys = List("a")
val partitionKeyRdd = sc.parallelize(cassandraPartitionKeys)
val cassandraRdd = partitionKeyRdd.joinWithCassandraTable(keyspace,table).map(x => x._2)
cassandraRdd.map(row => (row.getString("column_name"),1)).countByKey().collect.foreach(println)

It seems like this could be a partial answer (it provides the correct data, but there likely is a better solution)
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").groupBy(row => row.getString("a"))
rdd.foreach{ row => { println(row._1 + " " + row._2.count(x => true)) } }
I'm assuming there is a better solution, but this looks to work in terms of getting results.

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence

Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.

From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.