Spark table transformation (ERROR: 5063) - scala

I have the following data:
val RDDApp = sc.parallelize(List("A", "B", "C"))
val RDDUser = sc.parallelize(List(1, 2, 3))
val RDDInstalled = sc.parallelize(List((1, "A"), (1, "B"), (2, "B"), (2, "C"), (3, "A"))).groupByKey
val RDDCart = RDDUser.cartesian(RDDApp)
I want to map this data so that I have an RDD of tuples with (userId, Boolean if the letter is given for user). I thought I found a solution with this:
val results = RDDCart.map (entry =>
(entry._1, RDDInstalled.lookup(entry._1).contains(entry._2))
)
If I call results.first, I get org.apache.spark.SparkException: SPARK-5063. I see the problem with the Action within the Mapping function but do not know how I can work around it so that I get the same result.

Just join and mapValues:
RDDCart.join(RDDInstalled).mapValues{case (x, xs) => xs.toSeq.contains(x)}

Related

printing length of string of Tuples in Scala

I am a newbie to Scala.
I have a Tuple[Int, String]
((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
For the above list, I want to print all strings where the corresponding length is 4.
printing length of string of Tuples in Scala
val tuples = List((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
println(tuples.map(x => (x._2, x._2.length)))
//List((alpha,5), (beta,4), (gamma,5), (zeta,4), (omega,5))
I want to print all strings where the corresponding length is 4
You can filter first and then print as
val tuples = List((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
tuples.filter(_._2.length == 4).foreach(x => println(x._2))
it should print
beta
zeta
You can convert your Tuple to List and then map and filter as you need:
tuple.productIterator.toList
.map{case (a,b) => b.toString}
.filter(_.length==4)
Example:
For the given input:
val tuple = ((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
tuple: ((Int, String), (Int, String), (Int, String), (Int, String), (Int, String)) = ((1,alpha),(2,beta),(3,gamma),(4,zeta),(5,omega))
Output:
List[String] = List(beta, zeta)
Let's suppose you have a list of Tuple, and you need all the values with string length equal to 4.
You can do a filter on the list:
val filteredList = list.filter(_._2.length == 4)
And then iterate over each element to print them:
filteredList.foreach(tuple => println(tuple._2))
Here is way to achieve this
scala> val x = ((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
x: ((Int, String), (Int, String), (Int, String), (Int, String), (Int, String)) = ((1,alpha),(2,beta),(3,gamma),(4,zeta),(5,omega))
scala> val y = x.productIterator.toList.collect{
case ele : (Int, String) if ele._2.length == 4 => ele._2
}
y: List[String] = List(beta, zeta)

Partial/Full-match value in one RDD to values in another RDD

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?
I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+
Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!

read individual elements of a tuple from a map((tuple),(tuple)) in scala

The generated output of reducebykey is an ShuffledRDD with key-value both as array of multiple field. I need to extract all the fields and write to a hive table.
Below is the code which I was trying:
sqlContext.sql(s"select SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT,RMNG_NW_OP_KEY, ACCESS_TYPE FROM FACT.FCT_MEDIATED_USAGE_DATA")
val USAGE_DATA_Reduce = USAGE_DATA.map{ USAGE_DATA => ((USAGE_DATA.getShort(0), USAGE_DATA.getString(1),USAGE_DATA.getString(2)),
(USAGE_DATA.getInt(3), USAGE_DATA.getInt(4)))}.reduceByKey((x, y) => (math.min(x._1, y._1), math.max(x._2,y._2)))
The final output what I am expecting is all the five fields as:
SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT, MINVAL, MAXVAL
So that it can be directly inserted to hive table
If you mean:
Given a RDD[(TupleN, TupleM)], how do I map each record's elements of both key and value tuples into a single concatenated string?
Here's a simplified version, you should be able extrapolate this to solve your problem:
val keyValueRdd = sc.parallelize(Seq(
(1, "key1") -> (10, "value1", "A"),
(2, "key2") -> (20, "value2", "B"),
(3, "key3") -> (30, "value3", "C")
))
val asStrings: RDD[String] = keyValueRdd.map {
case ((k1, k2), (v1, v2, v3)) => List(k1, k2, v1, v2, v3).mkString(",")
}
asStrings.foreach(println)
// prints:
// 3,key3,30,value3,C
// 2,key2,20,value2,B
// 1,key1,10,value1,A

Spark : anti-join two DStreams

I can do JOINs on two Spark DStreams like :
val joinStream = stream1.join(stream2)
Now, what if I need to filter out all the records that weren't JOINed. Essentially, something like stream1.anti-join(stream2). Is this possible somehow?
Thanks and appreciate any help!
Assuming you had these:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))

Join two lists with unequal length in Scala

I have 2 lists:
val list_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val list_2 = List((1, 111), (2, 122), (3, 133), (4, 144), (1, 123), (2, 234))
I want to replace key in the second list as value of first list, resulting in a new list that looks like:
List ((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
This is my attempt:
object UniqueTest {
def main(args: Array[String]){
val l_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val l_2 = List((1, 111), (2,122), (3, 133), (4, 144), (1, 123), (2, 234))
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
print(l_3)
}
def f(i: Int, list: List[(Int, Int)]): Int = {
for(pair <- list){
if(i == pair._1){
return pair._2
}
}
return 0
}
}
This results in:
((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
Is the program above a good way to do this? Are there built-in functions in Scala to handle this need, or another way to do this manipulation?
The only real over-complication you make is this line:
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
Your f function uses an imperative style to loop over a list to find a key. Any time you find yourself doing this, it's a good indication what you want is a map. By doing the for loop each time you're exploding the computational complexity: a map will allow you to fetch the corresponding value for a given key in O(1). With a map you first convert your list, which is a key-value pair, to a datastructure explicit about supporting the key-value pair relationship.
Thus, the first thing you should do is build your map. Scala provides a really easy way to do this with toMap:
val map_1 = list_1.toMap
Then it is just a matter of 'mapping':
val result = list_2.map { case (key, value) => map_1.getOrElse(key, 0), value) }
This takes each case in your list_2, matches the first value (key) to a key in your map_1, retrieves that value (or the default 0) and puts as the first value in a key-value tuple.
You can do:
val map = l_1.toMap // transform l_1 to a Map[Int, Int]
// for each (a, b) in l_2, retrieve the new value v of a and return (v, b)
val res = l_2.map { case (a, b) => (map.getOrElse(a, 0), b) }
The most idiomatic way is zipping them together and then transforming according to your needs:
(list_1 zip list_2) map { case ((k1, v1), (k2, v2)) => (v1, v2) }