Spark dataframe - Replace tokens of a common string with column values for each row using scala - scala

I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.
val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()
+------+------+------+
|Number| Name| Color|
+------+------+------+
| 4|Orange|Orange|
| 3| Apple| Green|
| 1| Apple| Red|
| 2|Banana|Yellow|
| 5| Apple| Red|
+------+------+------+
My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values
For example: commonDynamicString = Column.Name with Column.Color color
In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.
I don't want to use RDD unless no other option is available with dataframe.
Below are the approaches I tried but couldn't achieve my objective.
Option 1:
val a = df.foreach(t => {
finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
.replace("DF.Name", t.getAs("Name"))
.replace("DF.Color", t.getAs("Color"))
println ("finalValue: " +finalValue)
})
With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit
and spark throws error.
Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final"
using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?
+------+------+------+--------------------------+
|Number| Name| Color| Final |
+------+------+------+--------------------------+
| 4|Orange|Orange|Orange with orange color |
| 3| Apple| Green|Apple with Green color |
| 1| Apple| Red|Apple with Red color |
| 2|Banana|Yellow|Banana with Yellow color |
| 5| Apple| Red|Apple with Red color |
+------+------+------+--------------------------+
Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?
Thanks!

If I understand your requirement correctly, you can create a column method, say, parseStatement which takes a String-type statement and returns a Column with the following steps:
Parse the input statement to count number of tokens
Generate a Regex pattern in the form of ^(.*?)(token1)(.*?)(token2) ... (.*?)$
Apply pattern matching to assemble a colList consisting of lit(g1), col(g2), lit(g3), col(g4), ..., where the g?s are the extracted Regex groups
Concatenate the Column-type items
Here's the sample code:
import spark.implicits._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def parseStatement(stmt: String): Column = {
val token = "Column."
val tokenPattern = """Column\.(\w+)"""
val literalPattern = "(.*?)"
val colCount = stmt.sliding(token.length).count(_ == token)
val pattern = (0 to colCount * 2).map{
case i if (i % 2 == 0) => literalPattern
case _ => tokenPattern
}.mkString
val colList = ("^" + pattern + "$").r.findAllIn(stmt).
matchData.toList.flatMap(_.subgroups).
zipWithIndex.map{
case (g, i) if (i % 2 == 0) => lit(g)
case (g, i) => col(g)
}
concat(colList: _*)
}
val df = Seq(
(4, "Orange", "Orange"),
(3, "Apple", "Green"),
(1, "Apple", "Red"),
(2, "Banana", "Yellow"),
(5, "Apple", "Red")
).toDF("Number", "Name", "Color")
val statement = "Column.Name with Column.Color color"
df.withColumn("Final", parseStatement(statement)).
show(false)
// +------+------+------+------------------------+
// |Number|Name |Color |Final |
// +------+------+------+------------------------+
// |4 |Orange|Orange|Orange with Orange color|
// |3 |Apple |Green |Apple with Green color |
// |1 |Apple |Red |Apple with Red color |
// |2 |Banana|Yellow|Banana with Yellow color|
// |5 |Apple |Red |Apple with Red color |
// +------+------+------+------------------------+
Note that concat takes column-type parameters, hence the need of col() for column values and lit() for literals.

Related

How to transform this dataset to the following dataset

Input
+------+------+------+------+
|emp_name|emp_area| dept|zip|
+------+------+------+------+
|ram|USA|"Sales"|805912|
|sham|USA|"Sales"|805912|
|ram|Canada|"Marketing"|805912|
|ram|USA|"Sales"|805912|
|sham|USA|"Marketing"|805912|
+------+------+------+------
Desired output
feature |Top1 name |Top 1 value1|Top2 name|top 2 value|
emp_name ram |3|sham |2
emp_area Usa |4|canada |1
dept sales|3|Marketing|3
zip 805912|5|NA|NA
I started with dynamically generating the count for each one of them but unable to store them in a dataset
val features=ds.columns.toList
for (e <- features) {
val ds1=ds.groupBy(e).count().sort(desc("count")).limit(5).withColumnRenamed("count", e+"_count")
}
Now how to collect all the values into one dataframe and transform to the output?
Here's a slightly verbose approach. You can map each column to a dataframe with one row, which corresponds to the row in the desired output. Add NA columns if necessary. Convert the column names to the desired ones, and finally do a unionAll to combine the dataframes (one row each).
import org.apache.spark.sql.expressions.Window
val top = 2
val result = ds.columns.map(
c => ds.groupBy(c).count()
.withColumn("rn", row_number().over(Window.orderBy(desc("count"))))
.filter(s"rn <= $top")
.groupBy().pivot("rn")
.agg(first(col(c)), first(col("count")))
.select(lit(c), col("*"))
).map(df =>
if (df.columns.size != 1 + top*2)
df.select(List(col("*")) ::: (1 to (top*2+1 - df.columns.size)).toList.map(x => lit("NA")): _*)
else df
).map(df =>
df.toDF(List("feature") ::: (1 to top).toList.flatMap(x => Seq(s"top$x name", s"top$x value")): _*)
).reduce(_ unionAll _)
result.show
+--------+---------+----------+---------+----------+
| feature|top1 name|top1 value|top2 name|top2 value|
+--------+---------+----------+---------+----------+
|emp_name| ram| 3| sham| 2|
|emp_area| USA| 4| Canada| 1|
| dept| Sales| 3|Marketing| 2|
| zip| 805912| 5| NA| NA|
+--------+---------+----------+---------+----------+

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

How to create a data frame in a for loop with the variable that is iterating in loop

So I have a huge data frame which is combination of individual tables, it has an identifier column at the end which specifies the table number as shown below
+----------------------------+
| col1 col2 .... table_num |
+----------------------------+
| x y 1 |
| a b 1 |
| . . . |
| . . . |
| q p 2 |
+----------------------------+
(original table)
I have to split this into multiple little dataframes based on table num. The number of tables combined to create this is pretty large so it's not feasible to individually create the disjoint subset dataframes, so I was thinking if I made a for loop iterating over min to max values of table_num I could achieve this task but I can't seem to do it, any help is appreciated.
This is what I came up with
for (x < min(table_num) to max(table_num)) {
var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()
but I don't think the declaration is right.
so essentially what I need is df's that look like this
+-----------------------------+
| col1 col2 ... table_num |
+-----------------------------+
| x y 1 |
| a b 1 |
+-----------------------------+
+------------------------------+
| col1 col2 ... table_num |
+------------------------------+
| xx xy 2 |
| aa bb 2 |
+------------------------------+
+-------------------------------+
| col1 col2 ... table_num |
+-------------------------------+
| xxy yyy 3 |
| aaa bbb 3 |
+-------------------------------+
... and so on ...
(how I would like the Dataframes split)
In Spark Arrays can be almost in data type. When made as vars you can dynamically add and remove elements from them. Below I am going to isolate the table nums into their own array, this is so I can easily iterate through them. After isolated I go through a while loop to add each table as a unique element to the DF Holder Array. To query the elements of the array use DFHolderArray(n-1) where n is the position you want to query with 0 being the first element.
//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect
//Build the iterator
var iterator = 1
//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF
//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF)
//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
//Call the table that is stored in that location of the array
tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
//Fluff
interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
//If logic to overwrite or append the DF
DFHolderArray = if (iterator==1) {
Array(interimDF)
} else {
DFHolderArray ++ Array(interimDF)
}
iterator = iterator + 1
}
//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....
Approach is to collect all unique keys and build respective data frames. I added some functional flavor to it.
Sample dataset:
name,year,country,id
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7746
Borussia Dortmund,2014,Germany,7746
Borussia Mönchengladbach,2014,Germany,7746
Schalke 04,2014,Germany,7746
Schalke 04,2014,Germany,7753
Lazio,2014,Germany,7753
Code:
val df = spark.read.format(source = "csv")
.option("header", true)
.option("delimiter", ",")
.option("inferSchema", true)
.load("groupby.dat")
import spark.implicits._
//collect data for each key into a data frame
val uniqueIds = df.select("id").distinct().map(x => x.mkString.toInt).collect()
// List buffer to hold separate data frames
var dataframeList: ListBuffer[org.apache.spark.sql.DataFrame] = ListBuffer()
println(uniqueIds.toList)
// filter data
uniqueIds.foreach(x => {
val tempDF = df.filter(col("id") === x)
dataframeList += tempDF
})
//show individual data frames
for (tempDF1 <- dataframeList) {
tempDF1.show()
}
One approach would be to write the DataFrame as partitioned Parquet files and read them back into a Map, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", 1), ("c", "d", 1), ("e", "f", 1),
("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")
val filePath = "/path/to/parquet/files"
df.write.partitionBy("table_num").parquet(filePath)
val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)
val dfMap = ( for { n <- tableNumList } yield
(n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
).toMap
To access the individual DataFrames from the Map:
dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | a| b| 1|
// | c| d| 1|
// | e| f| 1|
// +---+---+---------+
dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | g| h| 2|
// | i| j| 2|
// +---+---+---------+

Iterate through rows in DataFrame and transform one to many

As an example in scala, I have a list and every item which matches a condition I want to appear twice (may not be the best option for this use case - but idea which counts):
l.flatMap {
case n if n % 2 == 0 => List(n, n)
case n => List(n)
}
I would like to do something similar in Spark - iterate over rows in a DataFrame and if a row matches a certain condition then I need to duplicate the row with some modifications in the copy. How can this be done?
For example, if my input is the table below:
| name | age |
|-------|-----|
| Peter | 50 |
| Paul | 60 |
| Mary | 70 |
I want to iterate through the table and test each row against multiple conditions, and for each condition that matches, an entry should be created with the name of the matched condition.
E.g. condition #1 is "age > 60" and condition #2 is "name.length <=4". This should result in the following output:
| name | age |condition|
|-------|-----|---------|
| Paul | 60 | 2 |
| Mary | 70 | 1 |
| Mary | 70 | 2 |
You can filter matching-conditions dataframes and then finally union all of them.
import org.apache.spark.sql.functions._
val condition1DF = df.filter($"age" > 60).withColumn("condition", lit(1))
val condition2DF = df.filter(length($"name") <= 4).withColumn("condition", lit(2))
val finalDF = condition1DF.union(condition2DF)
you should have your desired output as
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Mary|70 |1 |
|Paul|60 |2 |
|Mary|70 |2 |
+----+---+---------+
I hope the answer is helpful
You can also use a combination of an UDF and explode(), like in the following example:
// set up example data
case class Pers1 (name:String,age:Int)
val d = Seq(Pers1("Peter",50), Pers1("Paul",60), Pers1("Mary",70))
val df = spark.createDataFrame(d)
// conditions logic - complex as you'd like
// probably should use a Set instead of Sequence but I digress..
val conditions:(String,Int)=>Seq[Int] = { (name,age) =>
(if(age > 60) Seq(1) else Seq.empty) ++
(if(name.length <=4) Seq(2) else Seq.empty)
}
// define UDF for spark
import org.apache.spark.sql.functions.udf
val conditionsUdf = udf(conditions)
// explode() works just like flatmap
val result = df.withColumn("condition",
explode(conditionsUdf(col("name"), col("age"))))
result.show
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Paul| 60| 2|
|Mary| 70| 1|
|Mary| 70| 2|
+----+---+---------+
Here is one way to flatten it with rdd.flatMap:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val new_rdd = (df.rdd.flatMap(r => {
val conditions = Seq((1, r.getAs[Int](1) > 60), (2, r.getAs[String](0).length <= 4))
conditions.collect{ case (i, c) if c => Row.fromSeq(r.toSeq :+ i) }
}))
val new_schema = StructType(df.schema :+ StructField("condition", IntegerType, true))
spark.createDataFrame(new_rdd, new_schema).show
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Paul| 60| 2|
|Mary| 70| 1|
|Mary| 70| 2|
+----+---+---------+

Spark (scala) - Iterate over DF column and count number of matches from a set of items

So I can now iterate over a column of strings in a dataframe and check whether any of the strings contain any items in a large dictionary (see here, thanks to #raphael-roth and #tzach-zohar). The basic udf (not including broadcasting the dict list) for that is:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The next thing I am trying to do is also COUNT the number of matches that occur from the dict set, in the most efficient way possible (i'm dealing with very large datasets and dict files).
I have been trying to use findAllMatchIn in the udf, using both count and map:
val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))
But this returns a list of iterators (empty and non-empty) I get a type mismatch (found Iterator, required Boolean). I am not sure how to count the non-empty iterators (count and size and length don't work).
Any idea what i'm doing wrong? Is there a better / more efficient way to achieve what I am trying to do?
you can just change a little bit of the answers from your other question as
import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
Given the dataframe as
+---+---------+
|id |words |
+---+---------+
|1 |foo |
|2 |barriofoo|
|3 |gitten |
|4 |baa |
+---+---------+
and dict file as
val dict = Set("foo","bar","baaad")
You should have output as
+---+---------+----------+
| id| words|word_check|
+---+---------+----------+
| 1| foo| 1|
| 2|barriofoo| 2|
| 3| gitten| 0|
| 4| baa| 0|
+---+---------+----------+
I hope the answer is helpful