How to convert a Scala Spark Dataframe to LinkedHashMap[String, String] - scala

Below is my dataframe:
val myDF= spark.sql("select company, comp_id from my_db.my_table")
myDF: org.apache.spark.sql.DataFrame = [company: string, comp_id: string]
And the data looks like
+----------+---------+
| company |comp_id |
+----------+---------+
|macys | 101 |
|jcpenny | 102 |
|kohls | 103 |
|star bucks| 104 |
|macy's | 105 |
+----------+---------+
I'm trying to create a Map collection object (like below) in Scala from the above dataframe.
Map("macys" -> "101", "jcpenny" -> "102" ..., "macy's" -> "105")
Questions:
1)Will the sequence of the dataframe records match with the sequence of the content in the original file sitting under the table?
2)If I do a collect() on the dataframe, will the sequence of the array being created match with the sequence of the content in the original file?
Explanation: When i do a df.collect().map(t => t(0) -> t(1)).toMap, looks like the map collection object doesn't preserve the insertion order, which is also the default behaviour of a scala map.res01: scala.collection.immutable.Map[Any,Any] = Map(kohls -> 103, jcpenny -> 102 ...)
3)So, How to convert the dataframe into one of the scala's collection map objects which actually preserves the insertion order/record sequence.
Explanation: As LinkedHashMap is one of the scala map collection object types to ensure insertion order. I'm trying to find a way to convert the dataframe into a LinkedHashMap object.

You can use LinkedHashMap, from Scaladoc page:
"This class implements mutable maps using a hashtable. The iterator and all traversal methods of this class visit elements in the order they were inserted."
But the Dataframes does not guarantee the order will always be the same.

import collection.mutable.LinkedHashMap
var myMap = LinkedHashMap[String, String]()
myDF.collect().map(t => myMap += (t(0).toString -> t(1).toString))
when you print myMap
res01: scala.collection.mutable.LinkedHashMap[String,String] = Map(macys -> 101, ..)

Related

Convert Spark Dataframe to Scala Map of key, list of values

I have a dataframe of the form:
Abc | apple
Abc | mango
xyz | grapes
xyz | peach
I want to convert this dataframe into a scala map of (key, list of values) eg: (Abc->(apple,mango), (xyz -> (grapes,peach)).
My code :
concatenatedLogs.collect.map( r => {
val key = r(0).toString
val value = r(1).toString
var currList = testMap.getOrElse(key,List[String]())
currList = value ::currList
testMap+=(key -> currList)
}
)
It gives me Java heap space out of memory error. Is there a more efficient and easy way to do this ?
Spark is a distributed processing framework, when you are dealing with a lot of data. Spark is processing them on a cluster, when you call the collect function all the data that is read on all the different cores/machines is brought back to the driver. When you are doing this you need to make sure, you have enough memory on your driver.
What you are doing is highly inefficient, because you are collecting the entire dataframe to the driver and then you are doing transformations on it. Using spark, you could do something similar with the code below:
val someDF = Seq(
("Abc", "apple"),
("Abc", "mango"),
("xyz", "grapes"),
("xyz", "peach")
).toDF(
"group", "fruit")
val s = someDF.groupBy(col("group")).
agg(collect_list("fruit").as("fruits")).as[(String, List[String])].collect.toMap
the output of this
Map(Abc -> List(apple, mango), xyz -> List(grapes, peach))

Get Seq index from element in Spark Sql

I have a dataframe as shown below:
---------------------+------------------------
text | featured_text
---------------------+------------------------
sun | [type, move, sun]
---------------------+------------------------
I want to search the "text" column value in "featured_text" Array and get the index of the "text" value if present. In the above example, I want to search for "sun" in Array [type, move, sun] and result will be "2" (index).
Is there any spark sql function/scala function available to get the index from the element?
As far as I know, there is no function to do this directly with the Spark SQL API. However, you can use an UDF instead as follows (I'm assuming the input dataframe is called df):
val getIndex = udf((text: String, featuredText: Seq[String]) => {
featuredText.indexOf(text)
})
val df2 = df.withColumn("index", getIndex($"text", $"featured_text"))
Which will give:
+----+-----------------+-----+
|text| featured_text|index|
+----+-----------------+-----+
| sun|[type, move, sun]| 2|
+----+-----------------+-----+
In the case where the value is not present the index column will have a -1.

How to update multiple columns of Dataframe from given set of maps in Scala?

I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.
Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.
To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)

How map Key/Value pairs between two separate RDDs?

Still a beginner in Scala and Spark, I think I'm just being brainless here. I have two RDDs, one of the type :-
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
Other of the type :-
(String, String, String) = (r316079113_serv60i,fb_100007609418328,-795000)
As it can be seen, the first two columns of the two RDDs are of the same format. Basically they are ID's, one is 'tid' and the other is 'uid'.
The question is this :
Is there a method by which I can compare the two RDDs in such a manner that the tid and uid are matched in both and all the data for the same matching ids is displayed in a single row without any repetitions?
Eg : If I get a match of tid and uid between the two RDDs
((String, String), Int) = ((" v67430612_serv78i"," fb_201906266952256"),1)
(String, String, String) = (" v67430612_serv78i"," fb_201906266952256",-795000)
Then the output is:-
((" v67430612_serv78i"," fb_201906266952256",-795000),1)
The IDs in the two RDDs are not in any fixed order. They are random i.e the same uid and tid serial number may not correspond in both the RDDs.
Also, how will the solution change if the first RDD type stays the same but the second RDD changes to type :-
((String, String, String), Int) = ((daily_reward_android_5.76,fb_193055751144610,81000),1)
I have to do this without the use of Spark SQL.
I would suggest you to convert your rdds to dataframes and apply join for easiness.
Your first dataframe should be
+------------------+-------------------+-----+
|tid |uid |count|
+------------------+-------------------+-----+
| v67430612_serv78i| fb_201906266952256|1 |
+------------------+-------------------+-----+
The second dataframe should be
+------------------+-------------------+-------+
|tid |uid |amount |
+------------------+-------------------+-------+
| v67430612_serv78i| fb_201906266952256|-795000|
+------------------+-------------------+-------+
Then getting the final output is just inner join as
df2.join(df1, Seq("tid", "uid"))
which will give output as
+------------------+-------------------+-------+-----+
|tid |uid |amount |count|
+------------------+-------------------+-------+-----+
| v67430612_serv78i| fb_201906266952256|-795000|1 |
+------------------+-------------------+-------+-----+
Edited
If you want to do it without dataframe/spark sql then there is join in rdd way too but you will have to modify as below
rdd2.map(x => ((x._1, x._2), x._3)).join(rdd1).map(y => ((y._1._1, y._1._2, y._2._1), y._2._2))
This will work only if you have rdd1 and rdd2 as defined in your question as ((" v67430612_serv78i"," fb_201906266952256"),1) and (" v67430612_serv78i"," fb_201906266952256",-795000) respectively.
you should have final output as
(( v67430612_serv78i, fb_201906266952256,-795000),1)
Make sure that you trim the values for empty spaces. This will help you to be sure that both rdds have same values for key while joining, otherwise you might get an empty result.