I have a dataframe like below
+---+------------+----------------------------------------------------------------------+
|id |indexes |arrayString |
+---+------------+----------------------------------------------------------------------+
|2 |1,3 |[WrappedArray(3, Str3), WrappedArray(1, Str1)] |
|1 |2,4,3 |[WrappedArray(2, Str2), WrappedArray(3, Str3), WrappedArray(4, Str4)] |
|0 |1,2,3 |[WrappedArray(1, Str1), WrappedArray(2, Str2), WrappedArray(3, Str3)] |
+---+------------+----------------------------------------------------------------------+
i want to loop through arrayString and get the first element as index and second element as String. Then replace the indexes with String corresponding to the index in arrayString. I want an output like below.
+---+---------------+
|id |replacedString |
+---+---------------+
|2 |Str1,Str3 |
|1 |Str2,Str4,Str3 |
|0 |Str1,Str2,Str3 |
+---+---------------+
I tried using the below udf function.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val itemIndexArray = itemIndex.split("\\,")
arrayString.map(i => {
itemIndexArray.updated(i(0).toInt,i(1))
})
itemIndexArray
})
This is giving me error and i am not getting my desired output. Is there any other way to achieve this. I cant use explode and join as i want the indexes replaced with string without losing the order.
.
You can create an udf as below to get the required result, Convert to the Array of array to map and find the index as a key in map.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val indexList = itemIndex.split("\\,")
val array = arrayString.map(x => (x(0) -> x(1))).toMap
indexList map array mkString ","
})
dataframe.withColumn("arrayString", replaceIndex($"indexes", $"arrayString"))
.show( false)
Output:
+---+-------+--------------+
|id |indexes|arrayString |
+---+-------+--------------+
|2 |1,3 |Str1,Str3 |
|1 |2,4,3 |Str2,Str4,Str3|
|0 |1,2,3 |Str1,Str2,Str3|
+---+-------+--------------+
Hope this helps!
Related
How to implement complex transformation in a configurable manner .
I receive data in file say csv, avro etc which will remain same and through this I will make a dataframe
Now I need to write different function having different transformation logic. using spark scala which will be applied on dataframe .
Based on parameter we pass using config file , particular function will get executed with required transformation
The parameter which we will pass through configuration that will pick the respective function
Any input to implement this please
This can be done using the transform function in spark.
Start by defining the functions to perform transformations.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
def addOne(df: DataFrame) = {
df.withColumn("plusOne", df("col") + 1)
}
def addTwo(df: DataFrame) = {
df.withColumn("plusTwo", df("col") + 2)
}
Then define a test dataframe
val test = (1 to 10).toDF("col")
test.show(3, false)
/* outputs:
+---+
|col|
+---+
|1 |
|2 |
|3 |
+---+
*/
Then use the `transform function to make actual transformations based on your parameter in config.
val parameter = 1
val result1 = parameter match {
case 1 => test.transform(addOne)
case 2 => test.transform(addTwo)
}
result1.show(3, false)
/*
+---+-------+
|col|plusOne|
+---+-------+
|1 |2 |
|2 |3 |
|3 |4 |
+---+-------+
*/
In the case of parameter being a different value you can see how it would behave below.
val parameter = 2
// below code can be extracted into a function
val result2 = parameter match {
case 1 => test.transform(addOne)
case 2 => test.transform(addTwo)
}
result2.show(3, false)
/*
Outputs:
+---+-------+
|col|plusTwo|
+---+-------+
|1 |3 |
|2 |4 |
|3 |5 |
+---+-------+
*/
Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
I have this following dataframe where certain columns like version and datSetName are supposedly constants. I am trying to get these constants into a variable(version is of type float and dataSetName is string).
|id |version |dataSetName
|1 |1.0 | employee
|2 |1.0 | employee
|3 |1.0 | employee
|4 |1.0 | employee
using the following way gives me a Row
val datSetName = df.select("dataSetName").distinct.collect()(0)
what's the best way to get dataSetName and version into String and Float variables respectively.
Check below code.
verison
df
.select("version")
.distinct.map(_.getAs[Double](0))
.collect
.head
dataSetName
df
.select("dataSetName")
.distinct
.map(_.getAs[String](0))
.collect
.head
version & dataSetName
df
.select("version","dataSetName")
.distinct
.map(c => (c.getAs[Double](0),c.getAs[String](1)))
.collect
.head
(Double, String) = (1.0,employee) // Output
I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.
I would like to write code that would group a line iterator inputs: Iterator[InputRow] by timestamp an unique items (by unit and eventName), i.e. eventTime should be the latest timestamp in the new Iterator[T] list where InputRow is defined as
case class InputRow(unit:Int, eventName: String, eventTime:java.sql.Timestamp, value: Int)
Example data before grouping:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:12 |1 |A |2 |
|2018-06-02 16:05:13 |2 |A |2 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
After:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
What is a good approach to writing the above code in Scala?
Good news: your question already contains the verbs that correspond to the functional calls to be used in the code: group by, sort by (latest timestamp).
To sort InputRow by latest timestamp, we'll need an implicit ordering:
implicit val rowSortByTimestamp: Ordering[InputRow] =
(r1: InputRow, r2: InputRow) => r1.eventTime.compareTo(r2.eventTime)
// or shorter:
// implicit val rowSortByTimestamp: Ordering[InputRow] =
// _.eventTime compareTo _.eventTime
And now, having
val input: Iterator[InputRow] = // input data
Let's group them by (unit, eventName)
val result = input.toSeq.groupBy(row => (row.unit, row.eventName))
then extract the one with the latest timestamp
.map { case (gr, rows) => rows.sorted.last }
and sort from ealiest to latest
.toSeq.sorted
The result is
InputRow(2,B,2018-06-02 16:05:11.0,1)
InputRow(1,A,2018-06-02 16:05:14.0,3)
InputRow(2,A,2018-06-02 16:05:15.0,3)
You can use struct built-in function to combine eventTime and value column as struct so that max by eventTime (latest) can be taken when groupBy unit and eventName and aggregating, which should give you your desired output
import org.apache.spark.sql.functions._
df.withColumn("struct", struct("eventTime", "value"))
.groupBy("unit", "eventName")
.agg(max("struct").as("struct"))
.select(col("struct.eventTime"), col("unit"), col("eventName"), col("struct.value"))
as
+-------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-------------------+----+---------+-----+
|2018-06-02 16:05:14|1 |A |3 |
|2018-06-02 16:05:11|2 |B |1 |
|2018-06-02 16:05:15|2 |A |3 |
+-------------------+----+---------+-----+
You can accomplish that with a foldLeft and a map:
val grouped: Map[(Int, String), InputRow] =
rows
.foldLeft(Map.empty[(Int, String), Seq[InputRow]])({ case (acc, row) =>
val key = (row.unit, row.eventName)
// Get from the accumulator the Seq that already exists or Nil if
// this key has never been seen before
val value = acc.getOrElse(key, Nil)
// Update the accumulator
acc + (key -> (value :+ row))
})
// Get the last element from the list of rows when grouped by unit and event.
.map({case (k, v) => k -> v.last})
This assumes that the eventTimes are already stored in sorted order. If this is not a safe assumption, you can define an implicit Ordering for java.sql.Timestamp and replace v.last with v.maxBy(_.eventTime).
See here.
Edit
Or use .groupBy(row => (row.unit, row.eventName)) instead of the foldLeft:
implicit val ordering: Ordering[Timestamp] = _ compareTo _
val grouped = rows.groupBy(row => (row.unit, row.eventName))
.values
.map(_.maxBy(_.eventTime))