MAPING FUNCTIONS IN SPARK-SCALA - scala

I'm starting with Spark with Scala and I'm wondering how can I do this:
I have a Dataframe with a column with those distinct values (R1,R2,M1,M2,I1,I2) and I want to map those values and create a new column which values depend on the values mapped in the other column. For example I want to map the first column and get something like the second column
R1 it starts with R
R1 it starts with R
R2 it starts with R
M1 it starts with M
M2 it starts with M
I1 it starts with I
Thanks

import org.apache.spark.sql.functions._
import spark.implicits._
val substring = udf((str: String) => "Please, first use search ".concat(str.substring(0,1)))
val source = Seq("R1", "R2", "M1", "M2", "I1", "I2")
.toDF("col1")
.withColumn("col2", substring(col("col1")))
source.show(false)
// +----+--------------------------+
// |col1|col2 |
// +----+--------------------------+
// |R1 |Please, first use search R|
// |R2 |Please, first use search R|
// |M1 |Please, first use search M|
// |M2 |Please, first use search M|
// |I1 |Please, first use search I|
// |I2 |Please, first use search I|
// +----+--------------------------+

Related

Converting List of List to Dataframe

I'm reading in data (as show below) into a list of lists, and I want to convert it into a dataframe with seven columns. The error I get is: requirement failed: number of columns doesn't match. Old column names (1): value, new column names (7): <list of columns>
What am I doing incorrectly and how can I fix it?
Data:
Column1, Column2, Column3, Column4, Column5, Column6, Column7
a,b,c,d,e,f,g
a2,b2,c2,d2,e2,f2,g2
Code:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0)
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = body.toDF()
erDf.show()
You get this number of columns doesn't match error because your erDf dataframe contains only one column, that contains an array:
+----------------------------+
|value |
+----------------------------+
|[a, b, c, d, e, f, g] |
|[a2, b2, c2, d2, e2, f2, g2]|
+----------------------------+
You can't match this unique column with the seven columns contained in your header.
The solution here is, given this erDf dataframe, to iterate over your header columns list to build columns one by one. Your complete code thus become:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0).split(", ") // build header columns list
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = header
.zipWithIndex
.foldLeft(body.toDF())((acc, elem) => acc.withColumn(elem._1, col("value")(elem._2)))
.drop("value")
That will give you the following erDf dataframe:
+-------+-------+-------+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|Column5|Column6|Column7|
+-------+-------+-------+-------+-------+-------+-------+
| a| b| c| d| e| f| g|
| a2| b2| c2| d2| e2| f2| g2|
+-------+-------+-------+-------+-------+-------+-------+

Pyspark - groupby concat string columns by order

I have a dataframe with the following columns - User, Order, Food.
For example:
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
I would like to concatenate all of the foods into a single string sorted by order and grouped by per user
If I run the following:
df.groupBy("User").agg(concat_ws(" $ ",collect_list("Food")).alias("Food List"))
I get a single list but the foods are not concatenated in order.
User Food List
B Salad
C Bread
A Eggs $ Water $ Peaches
What is a good way to get the food list concatenated in order?
Try use window here:
Build the DataFrame
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
df.show()
+----+-----+-------+
|User|Order| Food|
+----+-----+-------+
| A| 1| Eggs|
| B| 1| Salad|
| A| 2|Peaches|
| C| 1| Bread|
| A| 3| Water|
+----+-----+-------+
Create window and apply a udf to join the strings:
w = Window.partitionBy('User').orderBy('Order').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
#pandas_udf(StringType(), PandasUDFType.GROUPED_AGG)
def _udf(v):
return ' $ '.join(v)
df = df.withColumn('Food List', _udf(df['Food']).over(w)).dropDuplicates(['User', 'Food List']).drop(*['Order', 'Food'])
df.show(truncate=False)
+----+----------------------+
|User|Food List |
+----+----------------------+
|B |Salad |
|C |Bread |
|A |Eggs $ Peaches $ Water|
+----+----------------------+
Based on the possible duplicate comment - collect_list by preserving order based on another variable, I was able to come up with a solution.
First define a sorter function. This takes a struct, sorts by order and then returns the list of items in a string format separated by ' $ '
# define udf
def sorter(l):
res = sorted(l, key=lambda x: x.Order)
return ' $ '.join([item[1] for item in res])
sort_udf = udf(sorter,StringType())
Then create the struct and run the sorter function:
SortedFoodList = (df.groupBy("User")
.agg(collect_list(struct("Order","Food")).alias("food_list"))
.withColumn("sorted_foods",sort_udf("food_list"))
.drop("food_list")
)

Spark dataframe - Replace tokens of a common string with column values for each row using scala

I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.
val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()
+------+------+------+
|Number| Name| Color|
+------+------+------+
| 4|Orange|Orange|
| 3| Apple| Green|
| 1| Apple| Red|
| 2|Banana|Yellow|
| 5| Apple| Red|
+------+------+------+
My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values
For example: commonDynamicString = Column.Name with Column.Color color
In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.
I don't want to use RDD unless no other option is available with dataframe.
Below are the approaches I tried but couldn't achieve my objective.
Option 1:
val a = df.foreach(t => {
finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
.replace("DF.Name", t.getAs("Name"))
.replace("DF.Color", t.getAs("Color"))
println ("finalValue: " +finalValue)
})
With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit
and spark throws error.
Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final"
using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?
+------+------+------+--------------------------+
|Number| Name| Color| Final |
+------+------+------+--------------------------+
| 4|Orange|Orange|Orange with orange color |
| 3| Apple| Green|Apple with Green color |
| 1| Apple| Red|Apple with Red color |
| 2|Banana|Yellow|Banana with Yellow color |
| 5| Apple| Red|Apple with Red color |
+------+------+------+--------------------------+
Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?
Thanks!
If I understand your requirement correctly, you can create a column method, say, parseStatement which takes a String-type statement and returns a Column with the following steps:
Parse the input statement to count number of tokens
Generate a Regex pattern in the form of ^(.*?)(token1)(.*?)(token2) ... (.*?)$
Apply pattern matching to assemble a colList consisting of lit(g1), col(g2), lit(g3), col(g4), ..., where the g?s are the extracted Regex groups
Concatenate the Column-type items
Here's the sample code:
import spark.implicits._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def parseStatement(stmt: String): Column = {
val token = "Column."
val tokenPattern = """Column\.(\w+)"""
val literalPattern = "(.*?)"
val colCount = stmt.sliding(token.length).count(_ == token)
val pattern = (0 to colCount * 2).map{
case i if (i % 2 == 0) => literalPattern
case _ => tokenPattern
}.mkString
val colList = ("^" + pattern + "$").r.findAllIn(stmt).
matchData.toList.flatMap(_.subgroups).
zipWithIndex.map{
case (g, i) if (i % 2 == 0) => lit(g)
case (g, i) => col(g)
}
concat(colList: _*)
}
val df = Seq(
(4, "Orange", "Orange"),
(3, "Apple", "Green"),
(1, "Apple", "Red"),
(2, "Banana", "Yellow"),
(5, "Apple", "Red")
).toDF("Number", "Name", "Color")
val statement = "Column.Name with Column.Color color"
df.withColumn("Final", parseStatement(statement)).
show(false)
// +------+------+------+------------------------+
// |Number|Name |Color |Final |
// +------+------+------+------------------------+
// |4 |Orange|Orange|Orange with Orange color|
// |3 |Apple |Green |Apple with Green color |
// |1 |Apple |Red |Apple with Red color |
// |2 |Banana|Yellow|Banana with Yellow color|
// |5 |Apple |Red |Apple with Red color |
// +------+------+------+------------------------+
Note that concat takes column-type parameters, hence the need of col() for column values and lit() for literals.

Filtering on a dataframe based on columns defined in a list

I have a dataframe -
df
+----------+----+----+-------+-------+
| WEEK|DIM1|DIM2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-02| 14|NULL| -5| 60|
|2016-04-30| 14| FR| 90| 4|
+----------+----+----+-------+-------+
I have defined a list as targetList
List(T1_diff, T2_diff)
I want to filter out all rows in dataframe where T1_diff and T2_diff is greater than 3. In this scenario the output should only contain the second row as first row contains -5 as T1_Diff. targetList can contain more columns, currently it has T1_diff, T2_diff, if there is another column called T3_diff, so that should be automatically handled.
What is the best way to achieve this ?
Suppose you have following List of columns which you want to filter out for a value greater than 3.
val lst = List("T1_diff", "T2_diff")
Then you can create a String using these column names and then pass that String to where function.
val condition = lst.map(c => s"$c>3").mkString(" AND ")
df.where(condition).show(false)
For the above Dataframe it will output only second row.
+----------+----+----+-------+-------+
|Week |Dim1|Dim2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-30|14 |FR |90 |4 |
+----------+----+----+-------+-------+
If you have another column say T3_diff you can add it to the List and it will get added to the filter condition.

How to Iterate Rows and compare one row column value to next row column value in Scala?

I am new to scala. I need some immediate help.
I have M*N spark sql dataframe something like below. I need to compare each row column values with next row column value.
Some thing like A1 to A2,A1 to A3, so on up to N . B1 to B2 B1 to B3 .
Could you someone please guide me how can compare row wise in spark sql?
ID COLUMN1 Column2
1 A1 B1
2 A2 B2
3 A3 B3
Thank you in Advance
Santhosh
If I understand the question correctly - you want to compare (using some function) each value to the value of the same column in the previous record. You can do that using the lag Window Function:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "A1", "B1"),
(2, "A2", "B2"),
(3, "A3", "B3")
).toDF("ID","COL1", "COL2")
// some made-up comparisons - fill in whatever you want...
def compareCol1(curr: Column, prev: Column): Column = curr > prev
def compareCol2(curr: Column, prev: Column): Column = concat(curr, prev)
// creating window - ordered by ID
val window = Window.orderBy("ID")
// using the window with lag function to compare to previous value in each column
df.withColumn("COL1-comparison", compareCol1($"COL1", lag("COL1", 1).over(window)))
.withColumn("COL2-comparison", compareCol2($"COL2", lag("COL2", 1).over(window)))
.show()
// +---+----+----+---------------+---------------+
// | ID|COL1|COL2|COL1-comparison|COL2-comparison|
// +---+----+----+---------------+---------------+
// | 1| A1| B1| null| null|
// | 2| A2| B2| true| B2B1|
// | 3| A3| B3| true| B3B2|
// +---+----+----+---------------+---------------+