How to update multiple columns of Dataframe from given set of maps in Scala? - scala

I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.

Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.

To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)

Related

How to map three elements of different types in Spark Shell?

After creating a RDD from a textfile, I need to use .map to create a new RDD of type [Int,String,String]...each element split by a comma delim. I don’t understand how to define a RDD with three different data types per record....
So far I have:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(i => i.split(,))
If I understand your question correctly, you are reading a text file to create an RDD[String], where each string is a record (line) in the file. However, these records contain an integer value, followed by two string values, with a comma delimiter. (For example, a record might be something like "5,string1,string2".)
An RDD can indeed only have a single type of record. It seems that you want to obtain a type that is a RDD[(Int, String, String)]—where the type of the RDD is a tuple of an Int, a String, and a String. (This is a shorthand for RDD[Tuple3[Int, String, String]], incidentally. If you're unfamiliar with Scala tuples, this link might help.)
Is that correct?
If so, map is an appropriate operation. However, the .split operation will return an Array[String], so the following will result in an RDD[Array[String]] as the type of abc2.
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(_.split(","))
BTW, the use of the underscore, _, is a shorthand for the following:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map(s => s.split(","))
In order to get the type you require, you should use an expression something like the following:
val abc1 = sc.textFile("hi.txt")
val abc2 = abc1.map {s =>
// Split the string into tokens, delimited by a comma, put result in an array.
val a = s.split(",")
// Create a tuple of the expected values, converting the first value to an integer.
(a(0).toInt, a(1), a(2))
}
Note that this assumes you always have three elements, and that the first is an integer. You will get errors if this is not the case (and you may want to add more error handling).

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)

How to filter data based on datatype?

Given data like this:
val my_data = sc.parallelize(Array(
"Key1, foobar, 10, twenty, 20",
"Key2, impt, 11, sixty, 6",
"Key3, helloworld, 110, seventy, 9"))
I would like to filter and create a key,value RDD like below:
key1, foobar
key1, twenty
key2, impt
key2, sixty
key3, helloworld
key3, seventy
What I've tried
I figured that I can just put the data in a table and let data types be inferred.
//is there a way to avoid writing to file???
my_data.coalesce(1).saveAsTextFile("/tmp/mydata.csv")
val df_mydata = sqlContext.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.load("/tmp/mydata.csv")
The above works such that I've got a table with the correct data types. However, I don't know how to filter on the data types and then create key/value pairs from it.
I could also use Character.isDigit instead of creating a schema but still need to know how to filter the key/value pairs
One way to solve it would be, as you mentioned, to use Character.isDigit together with a split and flatMap. Using your my_data as example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = my_data.map(_.split(",").map(_.trim).toList.filterNot(s => s.forall(_.isDigit)))
.flatMap{case ((key: String)::tail) => tail.map(t => (key, t))}.toDF("Key", "Value")
df.show()
Which will give you something like this:
+----+----------+
| Key| Value|
+----+----------+
|Key1| foobar|
|Key1| twenty|
|Key2| impt|
|Key2| sixty|
|Key3|helloworld|
|Key3| seventy|
+----+----------+
Here I also converted it into a dataframe, but if you want an rdd you can simply skip that step. for it to work, it's necessary for each row to contain a key and that key should be at the first position in the string.
Hope it helps!
EDIT:
Breakdown of the commands used.
The first map goes through each string in your rdd, for each string the following is applied (in order):
.split(",")
.map(_.trim)
.toList
.filterNot(s => s.forall(_.isDigit))
Let's use your first row as an example: "Key1, foobar, 10, twenty, 20". First the row is split by "," whcih will give you an array of strings Array("Key1", " foobar", " 10", " twenty", " 20").
Next is map(_.trim) which will trim (remove white spaces before and after the words) each of the elements in the array, the array is also converted into a list (for the case match in the flatMap later on): List("Key1", "foobar", "10", "twenty", "20").
The filterNot will remove all strings where all characters is a digit. The forall here checks so each character fulfills this condition. This will remove some elements in the list: List("Key1", "foobar", "twenty").
Now, with the filtering done only the grouping after key remains:
flatMap{case ((key: String)::tail) => tail.map(t => (key, t))}
Here the key becomes the first element each row's list, following the example row from before it becomes "Key1". tail is simply the rest of the list. Then, for each element that is not the key value, we replace it with a tuple (key, value). In other words, each element (except the first, that is key) becomes a tuple containing the key and itself. flatMap is used here as otherwise you would get a list of tuples, not simply tuples as wanted.
The last is converting it to a dataframe with named columns using toDF("Key", "Value"), note that this requires the import used in the beginning (import spark.implicits._).

Passing a list of tuples as a parameter to a spark udf in scala

I am trying to pass a list of tuples to a udf in scala. I am not sure how to exactly define the datatype for this. I tried to pass it as a whole row but it can't really resolve it. I need to sort the list based on the first element of the tuple and then send n number of elements back. I have tried the following definitions for the udf
def udfFilterPath = udf((id: Long, idList: Array[structType[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Array[Tuple2[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Row)
This is what the idList looks like:
[[1234,"Tony"], [2345, "Angela"]]
[[1234,"Tony"], [234545, "Ruby"], [353445, "Ria"]]
This is a dataframe with a 100 rows like the above. I call the udf as follows:
testSet.select("id", "idList").withColumn("result", udfFilterPath($"id", $"idList")).show
When I print the schema for the dataframe it reads it as a array of structs. The idList itself is generated by doing a collect list over a column of tuples grouped by a key and stored in the dataframe. Any ideas on what I am doing wrong? Thanks!
When defining a UDF, you should use plain Scala types (e.g. Tuples, Primitives...) and not the Spark SQL types (e.g. StructType) as the output types.
As for the input types - this is where it gets tricky (and not too well documented) - an array of tuples would actually be a mutable.WrappedArray[Row]. So - you'll have to "convert" each row into a tuple first, then you can do the sorting and return the result.
Lastly, by your description it seems that id column isn't used at all, so I removed it from the UDF definition, but it can easily be added back.
val udfFilterPath = udf { idList: mutable.WrappedArray[Row] =>
// converts the array items into tuples, sorts by first item and returns first two tuples:
idList.map(r => (r.getAs[Long](0), r.getAs[String](1))).sortBy(_._1).take(2)
}
df.withColumn("result", udfFilterPath($"idList")).show(false)
+------+-------------------------------------------+----------------------------+
|id |idList |result |
+------+-------------------------------------------+----------------------------+
|1234 |[[1234,Tony], [2345,Angela]] |[[1234,Tony], [2345,Angela]]|
|234545|[[1234,Tony], [2345454,Ruby], [353445,Ria]]|[[1234,Tony], [353445,Ria]] |
+------+-------------------------------------------+----------------------------+

Spark dataframe get column value into a string variable

I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?
The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)
By this snippet, you can extract all the values in a column into a string.
Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10
Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.
For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)
s is the string of column values
.collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.
x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.
s =""
// say the n-th column is the target column
val temp = test.collect() // converts Rows to array of list
temp.foreach{x =>
s += (x(n-1).asInstanceOf[String])
}
println(s)