How to create key-value pairs DStream in Spark Streaming - scala

I'm new to Spark Streaming. There's a project using Spark Streaming, the input is a key-value pair string like "productid,price".
The requirement is to process each line as a separate transaction, and make RDD triggered every 1 second.
In each interval I have to calculate the total price for each individual product, like
select productid, sum(price) from T group by productid
My current thought is that I have to do the following steps
1) split the whole line with \n val lineMap = lines.map{x=>x.split("\n")}
2) split each line with "," val
recordMap=lineMap.map{x=>x.map{y=>y.split(",")}}
Now I'm confused about how to make the first column as key and second column as value, and use reduceByKey function to get the total sum.
Please advise.
Thanks

Once you have split each row, you can do something like this:
rowItems.map { case Seq(product, price) => product -> price }
This way you obtain a DStream[(String, String)] on which you can apply pair transformations like reduceByKey (don't forget to import the required implicits).

Related

Getting Original Column names after Aggregation in Spark with scala Code

I would like to know if there is an easy way to remove aggregate functions from dataframe columns after making an aggregation.
For example: I have a dataframe on which I have calculated the average of columns with Int or double Types. However after aggregation is done the columns are written like this avg(column_name). I would like to remove avg and brackets from the column names.
Do you have any idea how to do it with a simple regex in spark2?
Here is how I do it today:
val aggregate =df.groupBy("column_name").agg(aggExprs.head,aggExprs.tail:_*).toDF()
val rename_column = aggregate.columns.foldLeft(aggregate){(newdf, colname) => newdf.withColumnRenamed(colname,colname.replace(",", "").replace(" ", "").replace("last", "").replace(")", "").replace("avg(",""))}
Any help would be welcome

Spark-Scala: Incremental Data load in Spark Scala along with generation of Unique Id

I am using zipWithIndex to generate sequence_number and add it as a separate column.
I am using code similar to below:
val file = sparkSession.createDataFrame(lexusmasterrdd,structSchema)
val filerdd=file.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((((indexedRow._2.toLong+1)).toLong) +: indexedRow._1.toSeq))
val newSchema=StructType(Array(StructField("Sequence_number",LongType,true)).++(file.schema.fields))
val finalDF=sparkSession.createDataFrame(filerdd,newSchema)
I am now trying to come up with a logic for incremental load for the same.
A simple load where new data is appended to existing data and sequence numbers are generated from last generated number.
One way to achieve this by getting the max(Sequence_number) and then adding along with a row_number() function for new data.
But is there any other way in which i can make use of zipWithIndex in incremental load?
Some code would be helpful.
I am using Spark 2.3 with Scala
One way to achieve this by getting the max(Sequence_number) and then
adding along with a row_number() function for new data.
This would work, but does not scale because row_number() would need to shuffle all records into 1 partition. I would rather use monotonically_increasing_id():
//get max from "old" data
val prevMaxId = oldDf.select(max($"Sequence_number")).as[Long].head()
val addUniqueID : Column = monotonically_increasing_id() + prevMaxId
val finalDF = newDF.withColumn("Sequence_number",addUniqueID)
if you want to use zipWithIndex, you could something similar:
//get max from "old" data
val prevMaxId = oldDf.select(max($"Sequence_number")).as[Long].head()
val finalRDD = oldRdd.zipWithIndex().map{case (data,id) => (data, id+prevMaxId)}

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.

ScalaSpark - Create a pair RDD with a key and a list of values

I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?
keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").