How map Key/Value pairs between two separate RDDs? - scala

Still a beginner in Scala and Spark, I think I'm just being brainless here. I have two RDDs, one of the type :-
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
Other of the type :-
(String, String, String) = (r316079113_serv60i,fb_100007609418328,-795000)
As it can be seen, the first two columns of the two RDDs are of the same format. Basically they are ID's, one is 'tid' and the other is 'uid'.
The question is this :
Is there a method by which I can compare the two RDDs in such a manner that the tid and uid are matched in both and all the data for the same matching ids is displayed in a single row without any repetitions?
Eg : If I get a match of tid and uid between the two RDDs
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
(String, String, String) = (" v67430612_serv78i"," fb_201906266952256",-795000)
Then the output is:-
((" v67430612_serv78i"," fb_201906266952256",-795000),1)
The IDs in the two RDDs are not in any fixed order. They are random i.e the same uid and tid serial number may not correspond in both the RDDs.
Also, how will the solution change if the first RDD type stays the same but the second RDD changes to type :-
((String, String, String), Int) = ((daily_reward_android_5.76,fb_193055751144610,81000),1)
I have to do this without the use of Spark SQL.

I would suggest you to convert your rdds to dataframes and apply join for easiness.
Your first dataframe should be
+------------------+-------------------+-----+
|tid |uid |count|
+------------------+-------------------+-----+
| v67430612_serv78i| fb_201906266952256|1 |
+------------------+-------------------+-----+
The second dataframe should be
+------------------+-------------------+-------+
|tid |uid |amount |
+------------------+-------------------+-------+
| v67430612_serv78i| fb_201906266952256|-795000|
+------------------+-------------------+-------+
Then getting the final output is just inner join as
df2.join(df1, Seq("tid", "uid"))
which will give output as
+------------------+-------------------+-------+-----+
|tid |uid |amount |count|
+------------------+-------------------+-------+-----+
| v67430612_serv78i| fb_201906266952256|-795000|1 |
+------------------+-------------------+-------+-----+
Edited
If you want to do it without dataframe/spark sql then there is join in rdd way too but you will have to modify as below
rdd2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
This will work only if you have rdd1 and rdd2 as defined in your question as ((" v67430612_serv78i"," fb_201906266952256"),1) and (" v67430612_serv78i"," fb_201906266952256",-795000) respectively.
you should have final output as
(( v67430612_serv78i, fb_201906266952256,-795000),1)
Make sure that you trim the values for empty spaces. This will help you to be sure that both rdds have same values for key while joining, otherwise you might get an empty result.

Related

How to convert a Scala Spark Dataframe to LinkedHashMap[String, String]

Below is my dataframe:
val myDF= spark.sql("select company, comp_id from my_db.my_table")
myDF: org.apache.spark.sql.DataFrame = [company: string, comp_id: string]
And the data looks like
+----------+---------+
| company |comp_id |
+----------+---------+
|macys | 101 |
|jcpenny | 102 |
|kohls | 103 |
|star bucks| 104 |
|macy's | 105 |
+----------+---------+
I'm trying to create a Map collection object (like below) in Scala from the above dataframe.
Map("macys" -> "101", "jcpenny" -> "102" ..., "macy's" -> "105")
Questions:
1)Will the sequence of the dataframe records match with the sequence of the content in the original file sitting under the table?
2)If I do a collect() on the dataframe, will the sequence of the array being created match with the sequence of the content in the original file?
Explanation: When i do a df.collect().map(t => t(0) -> t(1)).toMap, looks like the map collection object doesn't preserve the insertion order, which is also the default behaviour of a scala map.res01: scala.collection.immutable.Map[Any,Any] = Map(kohls -> 103, jcpenny -> 102 ...)
3)So, How to convert the dataframe into one of the scala's collection map objects which actually preserves the insertion order/record sequence.
Explanation: As LinkedHashMap is one of the scala map collection object types to ensure insertion order. I'm trying to find a way to convert the dataframe into a LinkedHashMap object.
You can use LinkedHashMap, from Scaladoc page:
"This class implements mutable maps using a hashtable. The iterator and all traversal methods of this class visit elements in the order they were inserted."
But the Dataframes does not guarantee the order will always be the same.
import collection.mutable.LinkedHashMap
var myMap = LinkedHashMap[String, String]()
myDF.collect().map(t => myMap += (t(0).toString -> t(1).toString))
when you print myMap
res01: scala.collection.mutable.LinkedHashMap[String,String] = Map(macys -> 101, ..)

How to map three elements of different types in Spark Shell?

After creating a RDD from a textfile, I need to use .map to create a new RDD of type [Int,String,String]...each element split by a comma delim. I don’t understand how to define a RDD with three different data types per record....
So far I have:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(i => i.split(,))
If I understand your question correctly, you are reading a text file to create an RDD[String], where each string is a record (line) in the file. However, these records contain an integer value, followed by two string values, with a comma delimiter. (For example, a record might be something like "5,string1,string2".)
An RDD can indeed only have a single type of record. It seems that you want to obtain a type that is a RDD[(Int, String, String)]—where the type of the RDD is a tuple of an Int, a String, and a String. (This is a shorthand for RDD[Tuple3[Int, String, String]], incidentally. If you're unfamiliar with Scala tuples, this link might help.)
Is that correct?
If so, map is an appropriate operation. However, the .split operation will return an Array[String], so the following will result in an RDD[Array[String]] as the type of abc2.
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(_.split(","))
BTW, the use of the underscore, _, is a shorthand for the following:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(s => s.split(","))
In order to get the type you require, you should use an expression something like the following:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map {s =>
// Split the string into tokens, delimited by a comma, put result in an array.
val a = s.split(",")
// Create a tuple of the expected values, converting the first value to an integer.
(a(0).toInt, a(1), a(2))
}
Note that this assumes you always have three elements, and that the first is an integer. You will get errors if this is not the case (and you may want to add more error handling).

How to update multiple columns of Dataframe from given set of maps in Scala?

I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.
Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.
To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)

Passing a list of tuples as a parameter to a spark udf in scala

I am trying to pass a list of tuples to a udf in scala. I am not sure how to exactly define the datatype for this. I tried to pass it as a whole row but it can't really resolve it. I need to sort the list based on the first element of the tuple and then send n number of elements back. I have tried the following definitions for the udf
def udfFilterPath = udf((id: Long, idList: Array[structType[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Array[Tuple2[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Row)
This is what the idList looks like:
[[1234,"Tony"], [2345, "Angela"]]
[[1234,"Tony"], [234545, "Ruby"], [353445, "Ria"]]
This is a dataframe with a 100 rows like the above. I call the udf as follows:
testSet.select("id", "idList").withColumn("result", udfFilterPath($"id", $"idList")).show
When I print the schema for the dataframe it reads it as a array of structs. The idList itself is generated by doing a collect list over a column of tuples grouped by a key and stored in the dataframe. Any ideas on what I am doing wrong? Thanks!
When defining a UDF, you should use plain Scala types (e.g. Tuples, Primitives...) and not the Spark SQL types (e.g. StructType) as the output types.
As for the input types - this is where it gets tricky (and not too well documented) - an array of tuples would actually be a mutable.WrappedArray[Row]. So - you'll have to "convert" each row into a tuple first, then you can do the sorting and return the result.
Lastly, by your description it seems that id column isn't used at all, so I removed it from the UDF definition, but it can easily be added back.
val udfFilterPath = udf { idList: mutable.WrappedArray[Row] =>
// converts the array items into tuples, sorts by first item and returns first two tuples:
idList.map(r => (r.getAs[Long](0), r.getAs[String](1))).sortBy(_._1).take(2)
}
df.withColumn("result", udfFilterPath($"idList")).show(false)
+------+-------------------------------------------+----------------------------+
|id |idList |result |
+------+-------------------------------------------+----------------------------+
|1234 |[[1234,Tony], [2345,Angela]] |[[1234,Tony], [2345,Angela]]|
|234545|[[1234,Tony], [2345454,Ruby], [353445,Ria]]|[[1234,Tony], [353445,Ria]] |
+------+-------------------------------------------+----------------------------+