Calculating a variable inside RDD after full outer join in Scala - scala

What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:
rdd1 rdd2
id count id count
a 2 a 1
b 1 c 5
d 3
And the result I am searching for is this:
rdd2
id count
a 3
b 1
c 5
d 3
what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:
RDD[(String, org.apache.spark.sql.Row)]
For now, I have the following code:
var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}
How can I calculate that sum between RDDs?

If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:
val result = rdd1.fullOuterJoin(rdd2).map {
case (id, (left, right)) =>
(id, left.getOrElse(0) + right.getOrElse(0))
}
This applies if your RDD is of type (String, Int).

Related

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.

Why external sorter works in spark

I made processing data code in scala&spark and somehow it's so slow. I guess it's because of 'ExternalSort'. As you can see my code below, There is no reason to sort data but spark did.
I have more than 6,000,000 rows in RDD and try to cluster data with column name 'ID' (which are less than 20 types, so each ID group would be more than 300,000 rows)
I know It's pretty large data but other process were not slow. Any idea of this?
val ListByID = allData.map { x => (x.getAs[String]("ID"), List(x)) }.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
val goalData = ListByID.map({ rowList =>
val list = rowList._2
val ID = rowList._1
val SD = list.head.getAs[String]("SD")
val ANOTEHR_ID_CNT = list.map{ row=> row.getAs[String]("ANOTHER_ID")}.distinct.length
Row(
ID, ID, list.length,
list.count { row => row.getAs[Int]("FLAGA")==1 },
list.count { row => row.getAs[Int]("FLAGB")==1 },
SD, ANOTEHR_ID_CNT)
})
Following part:
allData.map{...}.reduceByKey{ (a: List[Row], b: List[Row]) => List(a, b).flatten }
is just a significantly more expensive implementation of groupByKey. It not only puts more pressure on GC by applying map-side aggregations but may also create huge number of temporary objects. If single group doesn't fit into memory then out-of-memory error is inevitable.
Next you group data and drag all the fields when all you do later is counting. It could be easily handled with simple aggregation.
Reduce by ID and ANOTHER_ID counting FLAGA=1, FLAGB=1 and keeping single SD
Reduce 1. by ID, sum FLAGA=1, FLAGB=1, 1 (distinct ANOTHER_ID), keep arbitrary SD.
Finally if you start with DataFrame why move data to less efficient format at all? With pseudocode:
df.groupBy("ID").agg(
count($"*"),
count(when($"FLAGA" === 1, 1)),
count(when($"FLAGB" === 1, 1))
countDistinct("ANOTHER_ID"),
first("SD")
)

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)

search rdd for value from another rdd

I am using Spark + Scala. My rdd1 has customer info i.e. (id, [name, address]). rdd2 has only names of high profile customers. Now I want to find if customer in rdd1 is high profile or not. How can I search one rdd using another? Joining rdd's is not looking like a good solution for me.
My code:
val result = rdd1.map( case (id, customer) =>
customer.foreach ( c =>
rdd2.filter(_ == c._1).count()!=0 ))
Error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations;
You have to broadcast one rdd by collecting it. You can broadcast the smaller rdd to improve performance.
val bcastRdd = sc.broadcast(rdd2.collect)
rdd1.map(
case (id, customer) => customer.foreach(c =>
bcastRdd.value.filter(_ == c._1).count()!=0))
You can use the left outer join, to avoid an expensive operation such as the collect (if your RDDs are big)
Also like Daniel pointed out, a broadcast is not necessary.
Here is a snippet that can help to obtain RDD1 with a flag which denotes he is a high profile customer or a low profile customer.
val highProfileFlag = 1
val lowProfileFlag = 0
// Keying rdd 1 by the name
val rdd1Keyed = rdd1.map { case (id, (name, address)) => (name, (id, address)) }
// Keying rdd 2 by the name and adding a high profile flag
val rdd2Keyed = rdd2.map { case name => (name, highProfileFlag) }
// The join you are looking for is the left outer join
val rdd1HighProfileFlag = rdd1Keyed
.leftOuterJoin(rdd2Keyed)
.map { case (name, (id, address), highProfileString) =>
val profileFlag = highProfileString.getOrElse(lowProfileFlag)
(id , (name, address, profileFlag))
}

scala operations on part of RDD

I am new to Scala and I am trying to do something for a project:
I generated a RDD: RDD
[UserID1, Date1, Value1]
[UserID1, Date2, Value2]
[UserID1, Date3, Value3]
[UserID2, Date1, Value1]
[UserID3, Date1, Value1]
I wish to run a function on this RDD that generates RDD
[UserID1, FunctionResult1, FunctionResult2]
[UserID1, FunctionResult3, FunctionResult4]
[UserID2, FunctionResult1, FunctionResult2]
The way this function should work is:
1, groupBy UserID, and sort the date on ascending order (I have already formatted the date into an INT: 20150225.
2, take the first Date and the Second Date and find the number of Day between them.
3, take the first Value and the Second Value, find the difference between them.
record these value to Function Results, proceed to process the second Date and third Date and the second and third Value.
If the input is 5 rows x 3 column, the result should be 4 rows x 3 column.
So far, I have tried to do reduceByKey on the RDD, but it only generates a single row in the output. So I was wondering if there is any other efficient way to do this, perhaps without looping? My current code looks like this
val basicsearchprofile = basicsearch.map(x=>(x._1,(x._2,x._3).reduceByKey((a,b)=> funcdiff(a,b))
// x._1 is the User ID, x._2 is the Date, x._3 is the Value;
def funcdiff(a:(Date,Value),b:(Date,Value)):(Day,value) =
{
val diffdays = (b._1%100 - a._1%100) + ((b._1/100)%100)- (a._1/100)%100))*30 + ((a._1/10000)%100 - (b._1/10000)%100))*365 //difference between days
val diffvalue = Math.abs(a._2 - b._2)
}
(diffdays diffvalue)
}
I assume that the returned value from the function funcdiff reduces every event in pairs, and eventually reduces it to a single row? Is it possible to make it apply function funcdiff to first row with second row, record answer; next apply it to second and third row...and so on such that the returned result is an RDD of [ID, Datediff, Valuediff]
Thanks in advance
Spark process rows parallel. As you have to do row2-row1, row3-row2, I think you can not work in parallel anymore. So you'll have to forget Spark a bit, use plain Scala and process a whole user data on a single node (each user can be processed in parallel though). For instance:
// First, group by user with Spark
case class Info(userId:String, date:Int, value:Int)
val infos=List(
Info("john",20150221,10),
Info("mary",20150221,11),
Info("john",20150222,12),
Info("mary",20150223,15),
Info("john",20150223,14),
Info("john",20150224,16),
Info("john",20150225,18),
Info("mary",20150225,17))
val infoRdd=sc.parallelize(infos)
val infoByIdRdd=infoRdd.map( info => (info.userId, info)).groupByKey()
// Then use plain Scala to process each user data
def infoDeltas(infos:List[Info]) = {
// Transform [Info1, Info2, Info3] into [(Info1,Info2),(Info2,Info3)]
val accZero:(Option[Info],List[(Info,Info)])=(None,List())
def accInfo(last:Option[Info], list:List[(Info,Info)], info:Info) = {
last match {
case None => (Some(info), list)
case Some(lastInfo) => (Some(info), list:+(lastInfo,info))
}
}
val infoIntervals=infos.foldLeft(accZero)(
(acc,info) => accInfo(acc._1, acc._2, info)
)._2
// Transform [(Info1,Info2),(Info2,Info3)] into [Info2-Info1,Info3-Info2]
infoIntervals.map(interval => interval match {case (before,after) => Info(after.userId,after.date-before.date,after.value-before.value)})
}
val infoDeltasByIdRdd=infoByIdRdd.mapValues(infos => infoDeltas(infos.toList))