DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2.
Transfer is the big category; e-transfer and IMT are subcategories.
The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a ID (34193), it should just be e-transfer pr IMT.
New to scala, don't know how to write a good function to do this. Please help!!
DataFrame 1 DataFrame 2
+---------+-------------+ +---------+------------------+
| ID | Category | | ID | Category |
+---------+-------------+ +---------+------------------+
| 31898 | Transfer | | 31898 | e-Transfer |
| 31898 | e-Transfer | | 32614 | e-Transfer + IMT|
| 32614 | Transfer | =====> | 33987 | Other |
| 32614 | e-Transfer | =====> | 34193 | e-Transfer |
| 32614 | IMT | +---------+------------------+
| 33987 | Transfer |
| 34193 | e-Transfer |
+---------+-------------+
You can group the DataFrame by ID to aggregate Category using collect_set to assemble arrays of categories, and create a new column based on content in the category arrays using array_contains:
import org.apache.spark.sql.functions._
val df = Seq(
(31898, "Transfer"),
(31898, "e-Transfer"),
(32614, "Transfer"),
(32614, "e-Transfer"),
(32614, "IMT"),
(33987, "Transfer"),
(34193, "e-Transfer")
).toDF("ID", "Category")
df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
withColumn( "Category",
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
"e-Transfer + IMT").otherwise(
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
"e-Transfer").otherwise(
when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
$"CategorySet"(0)).otherwise(
when($"CategorySet" === Array("Transfer"), "Other")
)))
).
show(false)
// +-----+---------------------------+----------------+
// |ID |CategorySet |Category |
// +-----+---------------------------+----------------+
// |33987|[Transfer] |Other |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer] |e-Transfer |
// |31898|[Transfer, e-Transfer] |e-Transfer |
// +-----+---------------------------+----------------+
Your sample data might not have covered all cases (e.g. [Transfer, MIT]). The existing sample code would generate null category value for any remaining cases. Simply modify/expand the conditional check if additional cases are identified.
Related
I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show
I have a spark data frame like:
|---------------------|------------------------------|
| Brand | Model |
|---------------------|------------------------------|
| Hyundai | Elentra,Creta |
|---------------------|------------------------------|
| Hyundai | Creta,Grand i10,Verna |
|---------------------|------------------------------|
| Maruti | Eritga,S-cross,Vitara Brezza|
|---------------------|------------------------------|
| Maruti | Celerio,Eritga,Ciaz |
|---------------------|------------------------------|
I want a data frame like this:
|---------------------|---------|--------|--------------|--------|---------|
| Brand | Model0 | Model1 | Model2 | Model3 | Model4 |
|---------------------|---------|--------|--------------|--------|---------|
| Hyundai | Elentra | Creta | Grand i10 | Verna | null |
|---------------------|---------|--------|--------------|--------|---------|
| Maruti | Ertiga | S-Cross| Vitara Brezza| Celerio| Ciaz |
|---------------------|---------|--------|--------------|--------|---------|
I have used this code :
schema = StructType([
StructField("Brand", StringType()),StructField("Model", StringType())])
tempCSV = spark.read.csv("PATH\\Cars.csv", sep='|', schema=schema)
tempDF = tempCSV.select(
"Brand",
f.split("Model", ",").alias("Model"),
f.posexplode(f.split("Model", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Brand",
f.concat(f.lit("Model"),f.col("pos").cast("string")).alias("name"),
f.expr("Model[pos]").alias("val")
)\
.groupBy("Brand").pivot("name").agg(f.first("val")).toPandas()
But I'm not getting the desired result. Instead of giving the second table result its giving :
|---------------------|---------|--------|--------------|
| Brand | Model0 | Model1 | Model2 |
|---------------------|---------|--------|--------------|
| Hyundai | Elentra | Creta | Grand i10 |
|---------------------|---------|--------|--------------|
| Maruti | Ertiga | S-Cross| Vitara Brezza|
|---------------------|---------|--------|--------------|
Thanks in advance.
This is happening because you are pivoting data on pos which has the repeat value in the same brand group.
You can use the rownumber() and pivot your data to generate the desired result.
Here are the sample code on top of the data you have provided.
df = sqlContext.createDataFrame([('Hyundai',"Elentra,Creta"),("Hyundai","Creta,Grand i10,Verna"),("Maruti","Eritga,S-cross,Vitara Brezza"),("Maruti","Celerio,Eritga,Ciaz")],("Brand","Model"))
tmpDf = df.select("Brand",f.split("Model", ",").alias("Model"),f.posexplode(f.split("Model", ",")).alias("pos", "val"))
tmpDf.createOrReplaceTempView("tbl")
seqDf = sqlContext.sql("select Brand, Model, pos, val, row_number() over(partition by Brand order by pos) as rnk from tbl")
seqDf.groupBy('Brand').pivot('rnk').agg(f.first('val'))
This will generate following result.
+-------+-------+-------+-------+---------+-------------+----+
| Brand| 1| 2| 3| 4| 5| 6|
+-------+-------+-------+-------+---------+-------------+----+
| Maruti| Eritga|Celerio|S-cross| Eritga|Vitara Brezza|Ciaz|
|Hyundai|Elentra| Creta| Creta|Grand i10| Verna|null|
+-------+-------+-------+-------+---------+-------------+----+
DF1 is what I have now, and I want make DF1 looks like DF2.
Desired Output:
DF1 DF2
+---------+-------------------+ +---------+------------------------------+
| ID | Category | | ID | Category |
+---------+-------------------+ +---------+------------------------------+
| 31898 | Transfer | | 31898 | Transfer (e-Transfer) |
| 31898 | e-Transfer | =====> | 32614 | Transfer (e-Transfer + IMT) |
| 32614 | Transfer | =====> | 33987 | Transfer (IMT) |
| 32614 | e-Transfer + IMT | +---------+------------------------------+
| 33987 | Transfer |
| 33987 | IMT |
+---------+-------------------+
Code:
val df = DF1.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
val DF2 = df.withColumn("Category", $"CategorySet"(0) ($"CategorySet"(1)))
The code is not working, how to solve it? And if there is any other better ways to do the same thing, I am open to it. Thank you in advance
You can try this:
val sliceRight = udf((array : Seq[String], from : Int) => " (" + array.takeRight(from).mkString(",") +")")
val df2 = df.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
df2.withColumn("Category", concat($"CategorySet"(0),sliceRight($"CategorySet",lit(1))))
.show(false)
Output:
+-----+----------------------------+---------------------------+
|ID |CategorySet |Category |
+-----+----------------------------+---------------------------+
|33987|[Transfer, IMT] |Transfer (IMT) |
|32614|[Transfer, e-Transfer + IMT]|Transfer (e-Transfer + IMT)|
|31898|[Transfer, e-Transfer] |Transfer (e-Transfer) |
+-----+----------------------------+---------------------------+
answer with slight modification
df.groupBy(“ID”).agg(collect_set(col(“Category”)).as(“Category”)).withColumn(“Category”, concat(col(“Category”)(0),lit(“ (“),col(“Category”)(1), lit(“)”))).show
There are two DF's , I need to populate a new column in DF1 say Flag on below conditions.
DF1
+------+-------------------+
||AMOUNT|Brand |
+------+-------------------+
| 47.88| Parle |
| 40.92| Parle |
| 83.82| Parle |
|106.58| Parle |
| 90.51| Flipkart |
| 11.48| Flipkart |
| 18.47| Flipkart |
| 40.92| Flipkart |
| 30.0| Flipkart |
+------+-------------------+
DF2
+--------------------+-------+----------+
| Brand | P1 | P2 |
+--------------------+-------+----------+
| Parle| 37.00 | 100.15 |
| Flipkart| 10.0 | 30.0 |
+--------------------+-------+----------+
if the amount of say Brand Parle in DF1 is less than P1 value (Amount < P1) in DF2 for the brand "Parle", then flag will be low, if P1 >= amount <= P2 than flag will be mid and if Amount > P2 then high,
Similarly for other merchants too.
the DF1 is having very huge data and DF2 is very small.
expected output
+------+-------------------+----------------+
||AMOUNT|Brand | Flag |
+------+-------------------+----------------+
| 47.88| Parle | mid |
| 40.92| Parle | mid |
| 83.82| Parle | mid |
|106.58| Parle | high |
| 90.51| Flipkart | high |
| 11.48| Flipkart | mid |
| 18.47| Flipkart | mid |
| 40.92| Flipkart | high |
| 30.0| Flipkart | mid |
+------+-------------------+----------------
I know that I can do a join and get the results, but how should i frame to logic in spark.
Simple left join and nested when inbuilt functions should get your desired result as
import org.apache.spark.sql.functions._
df1.join(df2, Seq("Brand"), "left")
.withColumn("Flag", when(col("AMOUNT") < col("P1"), "low").otherwise(
when(col("AMOUNT") >= col("P1") && col("AMOUNT") <= col("P2"), "mid").otherwise(
when(col("AMOUNT") > col("P2"), "high").otherwise("unknown"))))
.select("AMOUNT", "Brand", "Flag")
.show(false)
which should give you
+------+--------+----+
|AMOUNT|Brand |Flag|
+------+--------+----+
|47.88 |Parle |mid |
|40.92 |Parle |mid |
|83.82 |Parle |mid |
|106.58|Parle |high|
|90.51 |Flipkart|high|
|11.48 |Flipkart|mid |
|18.47 |Flipkart|mid |
|40.92 |Flipkart|high|
|30.0 |Flipkart|mid |
+------+--------+----+
I hope the answer is helpful
I think with udf is also doable.
val df3 = df1.join(df2, Seq("Brand"), "left")
import org.apache.spark.sql.functions._
val mapper = udf((amount: Double, p1: Double, p2: Double) => if (amount < p1) "low" else if (amount > p2) "high" else "mid")
df3.withColumn("Flag", mapper(df3("AMOUNT"), df3("P1"), df3("P2")))
.select("AMOUNT", "Brand", "Flag")
.show(false)
I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful