Transforming all new rows into new column in Spark with Scala - scala

I have a dataframe which has fix columns as m1_amt to m4_amt, containing data in the format below:
+------+----------+----------+----------+-----------+
|Entity| m1_amt | m2_amt | m3_amt | m4_amt |
+------+----------+----------+----------+-----------+
| ISO | 1 | 2 | 3 | 4 |
| TEST | 5 | 6 | 7 | 8 |
| Beta | 9 | 10 | 11 | 12 |
+------+----------+----------+----------+-----------+
I am trying to convert each new row into a new column as:
+----------+-------+--------+------+
| Entity | ISO | TEST | Beta |
+----------+-------+--------+------+
| m1_amt | 1 | 5 | 9 |
| m2_amt | 2 | 6 | 10 |
| m3_amt | 3 | 7 | 11 |
| m4_amt | 4 | 8 | 12 |
+----------+-------+--------+------+
How can I achieve this in Spark and Scala?

I had tried in below way:
scala> val df=Seq(("ISO",1,2,3,4),
| ("TEST",5,6,7,8),
| ("Beta",9,10,11,12)).toDF("Entity","m1_amt","m2_amt","m3_amt","m4_amt")
df: org.apache.spark.sql.DataFrame = [Entity: string, m1_amt: int ... 3 more fields]
scala> df.show
+------+------+------+------+------+
|Entity|m1_amt|m2_amt|m3_amt|m4_amt|
+------+------+------+------+------+
| ISO| 1| 2| 3| 4|
| TEST| 5| 6| 7| 8|
| Beta| 9| 10| 11| 12|
+------+------+------+------+------+
scala> val selectDf= df.selectExpr("Entity","stack(4,'m1_amt',m1_amt,'m2_amt',m2_amt,'m3_amt',m3_amt,'m4_amt',m4_amt)")
selectDf: org.apache.spark.sql.DataFrame = [Entity: string, col0: string ... 1 more field]
scala> selectDf.show
+------+------+----+
|Entity| col0|col1|
+------+------+----+
| ISO|m1_amt| 1|
| ISO|m2_amt| 2|
| ISO|m3_amt| 3|
| ISO|m4_amt| 4|
| TEST|m1_amt| 5|
| TEST|m2_amt| 6|
| TEST|m3_amt| 7|
| TEST|m4_amt| 8|
| Beta|m1_amt| 9|
| Beta|m2_amt| 10|
| Beta|m3_amt| 11|
| Beta|m4_amt| 12|
+------+------+----+
scala> selectDf.groupBy("col0").pivot("Entity").agg(concat_ws("",collect_list(col("col1")))).withColumnRenamed("col0","Entity").show
+------+----+---+----+
|Entity|Beta|ISO|TEST|
+------+----+---+----+
|m3_amt| 11| 3| 7|
|m4_amt| 12| 4| 8|
|m2_amt| 10| 2| 6|
|m1_amt| 9| 1| 5|
+------+----+---+----+

scala> df.show
+------+------+------+------+------+
|Entity|m1_amt|m2_amt|m3_amt|m4_amt|
+------+------+------+------+------+
| ISO| 1| 2| 3| 4|
| TEST| 5| 6| 7| 8|
| Beta| 9| 10| 11| 12|
+------+------+------+------+------+
scala> val df1 = df.withColumn("amt", to_json(struct(col("m1_amt"),col("m2_amt"),col("m3_amt"),col("m4_amt"))))
.withColumn("amt", regexp_replace(col("amt"), """[\\{\\"\\}]""", ""))
.withColumn("amt", explode(split(col("amt"), ",")))
.withColumn("cols", split(col("amt"), ":")(0))
.withColumn("val", split(col("amt"), ":")(1))
.select("Entity","cols","val")
scala> df1.show
+------+------+---+
|Entity| cols|val|
+------+------+---+
| ISO|m1_amt| 1|
| ISO|m2_amt| 2|
| ISO|m3_amt| 3|
| ISO|m4_amt| 4|
| TEST|m1_amt| 5|
| TEST|m2_amt| 6|
| TEST|m3_amt| 7|
| TEST|m4_amt| 8|
| Beta|m1_amt| 9|
| Beta|m2_amt| 10|
| Beta|m3_amt| 11|
| Beta|m4_amt| 12|
+------+------+---+
scala> df1.groupBy(col("cols")).pivot("Entity").agg(concat_ws("",collect_set(col("val"))))
.withColumnRenamed("cols", "Entity")
.show()
+------+----+---+----+
|Entity|Beta|ISO|TEST|
+------+----+---+----+
|m3_amt| 11| 3| 7|
|m4_amt| 12| 4| 8|
|m2_amt| 10| 2| 6|
|m1_amt| 9| 1| 5|
+------+----+---+----+

Related

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

How to combine dataframes with no common columns?

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each
You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+
A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

Scala Spark Incrementing a column based on another column in dataframe without for loops

I have a dataframe like the one below. I want a new column called cutofftype - which instead of the current monotonically increasing number should reset to 1 every time the ID column changes .
df = df.orderBy("ID","date").withColumn("cutofftype",monotonically_increasing_id()+1)
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 5|
| 54500| 2016-05-09| 6|
| 54500| 2016-05-16| 7|
| 54500| 2016-05-23| 8|
| 54500| 2016-06-06| 9|
| 54500| 2016-06-13| 10|
+------+---------------+----------+
Target is this as below :
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 1|
| 54500| 2016-05-09| 2|
| 54500| 2016-05-16| 3|
| 54500| 2016-05-23| 4|
| 54500| 2016-06-06| 5|
| 54500| 2016-06-13| 6|
+------+---------------+----------+
I know this can be done with for loops - i want to do it without for loops >> Is there a way out ?
Simple partition by problem. You should use the window.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ID").orderBy("date")
df.withColumn("cutofftype", row_number().over(w)).show()
+-----+----------+----------+
| ID| date|cutofftype|
+-----+----------+----------+
|54500|2016-05-02| 1|
|54500|2016-05-09| 2|
|54500|2016-05-16| 3|
|54500|2016-05-23| 4|
|54500|2016-06-06| 5|
|54500|2016-06-13| 6|
|54441|2016-06-20| 1|
|54441|2016-06-27| 2|
|54441|2016-07-04| 3|
|54441|2016-07-11| 4|
+-----+----------+----------+

Spark Dataframe ORDER BY giving mixed combination(asc + desc)

I have a Dataframe that I want to sort column by descending if the count value is greater than 10.
But I'm getting a mixed combination like ascending for couple of records then again descending and then again ascending and son on.
I'm using orderBy() function which sort the record in ascending by default.
Since i'm new to Scala and Spark I'm not getting the reason for why I'm getting this.
df.groupBy("Value").count().filter("count>5.0").orderBy("Value").show(1000);
reading the csv
val df = sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/test.csv")
.toDF("Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value")````
the records I'm getting by executing the above syntax:
+-------+-----+
| Value|count|
+-------+-----+
| 0| 225|
| 0.01| 12|
| 0.02| 13|
| 0.03| 12|
| 0.04| 15|
| 0.05| 9|
| 0.06| 11|
| 0.07| 9|
| 0.08| 6|
| 0.09| 10|
| 0.1| 66|
| 0.11| 12|
| 0.12| 9|
| 0.13| 12|
| 0.14| 8|
| 0.15| 10|
| 0.16| 14|
| 0.17| 11|
| 0.18| 14|
| 0.19| 21|
| 0.2| 78|
| 0.21| 16|
| 0.22| 15|
| 0.23| 13|
| 0.24| 7|
| 0.3| 85|
| 0.31| 7|
| 0.34| 8|
| 0.4| 71|
| 0.5| 103|
| 0.6| 102|
| 0.61| 6|
| 0.62| 9|
| 0.69| 7|
| 0.7| 98|
| 0.72| 6|
| 0.74| 8|
| 0.78| 7|
| 0.8| 71|
| 0.81| 10|
| 0.82| 9|
| 0.83| 8|
| 0.84| 6|
| 0.86| 8|
| 0.87| 10|
| 0.88| 12|
| 0.9| 95|
| 0.91| 9|
| 0.93| 6|
| 0.94| 6|
| 0.95| 8|
| 0.98| 8|
| 0.99| 6|
| 1| 254|
| 1.08| 8|
| 1.1| 80|
| 1.11| 6|
| 1.15| 9|
| 1.17| 7|
| 1.18| 6|
| 1.19| 9|
| 1.2| 94|
| 1.25| 7|
| 1.3| 91|
| 1.32| 8|
| 1.4| 215|
| 1.45| 7|
| 1.5| 320|
| 1.56| 6|
| 1.6| 280|
| 1.64| 6|
| 1.66| 10|
| 1.7| 310|
| 1.72| 7|
| 1.74| 6|
| 1.8| 253|
| 1.9| 117|
| 10| 78|
| 10.1| 45|
| 10.2| 49|
| 10.3| 30|
| 10.4| 40|
| 10.5| 38|
| 10.6| 52|
| 10.7| 35|
| 10.8| 39|
| 10.9| 42|
| 10.96| 7|------------mark
| 100| 200|
| 101.3| 7|
| 101.8| 8|
| 102| 6|
| 102.2| 6|
| 102.7| 8|
| 103.2| 6|--------------here
| 11| 93|
| 11.1| 32|
| 11.2| 38|
| 11.21| 6|
| 11.3| 42|
| 11.4| 32|
| 11.5| 34|
| 11.6| 38|
| 11.69| 6|
| 11.7| 42|
| 11.8| 25|
| 11.86| 6|
| 11.9| 39|
| 11.96| 9|
| 12| 108|
| 12.07| 7|
| 12.1| 31|
| 12.11| 6|
| 12.2| 34|
| 12.3| 28|
| 12.39| 6|
| 12.4| 32|
| 12.5| 31|
| 12.54| 7|
| 12.57| 6|
| 12.6| 18|
| 12.7| 33|
| 12.8| 20|
| 12.9| 21|
| 13| 85|
| 13.1| 25|
| 13.2| 19|
| 13.3| 30|
| 13.34| 6|
| 13.4| 32|
| 13.5| 16|
| 13.6| 15|
| 13.7| 31|
| 13.8| 8|
| 13.83| 7|
| 13.89| 7|
| 14| 46|
| 14.1| 10|
| 14.3| 10|
| 14.4| 7|
| 14.5| 15|
| 14.7| 6|
| 14.9| 11|
| 15| 52|
| 15.2| 6|
| 15.3| 9|
| 15.4| 12|
| 15.5| 21|
| 15.6| 11|
| 15.7| 14|
| 15.8| 18|
| 15.9| 18|
| 16| 44|
| 16.1| 30|
| 16.2| 26|
| 16.3| 29|
| 16.4| 26|
| 16.5| 32|
| 16.6| 42|
| 16.7| 44|
| 16.72| 6|
| 16.8| 40|
| 16.9| 54|
| 17| 58|
| 17.1| 48|
| 17.2| 51|
| 17.3| 47|
| 17.4| 57|
| 17.5| 51|
| 17.6| 51|
| 17.7| 46|
| 17.8| 33|
| 17.9| 38|---------again
|1732.04| 6|
| 18| 49|
| 18.1| 21|
| 18.2| 23|
| 18.3| 29|
| 18.4| 22|
| 18.5| 22|
| 18.6| 17|
| 18.7| 13|
| 18.8| 13|
| 18.9| 19|
| 19| 36|
| 19.1| 15|
| 19.2| 13|
| 19.3| 12|
| 19.4| 15|
| 19.5| 15|
| 19.6| 15|
| 19.7| 15|
| 19.8| 14|
| 19.9| 9|
| 2| 198|------------see after 19 again 2 came
| 2.04| 7|
| 2.09| 8|
| 2.1| 47|
| 2.16| 6|
| 2.17| 8|
| 2.2| 55|
| 2.24| 6|
| 2.26| 7|
| 2.27| 6|
| 2.29| 8|
| 2.3| 53|
| 2.4| 33|
| 2.5| 36|
| 2.54| 6|
| 2.59| 6|
Can you tell me what is wrong i'm doing.
My dataframe has column
"Country_Code", "Country","Data_Source","Data_File","Category","Metric","Time","Data_Cut1","Option1_Dummy","Option1_Visible","Value"
As we talked about in the comments, it seems your Value column is of type String. You can cast it to Double (for instance) to order it numerically.
This lines will cast the Value column to doubleType:
import org.apache.spark.sql.types._
df.withColumn("Value", $"Value".cast(DoubleType))
EXAMPLE INPUT
df.show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
With Value as Strings
df.orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 10.0| b|
| 2| a|
+-----+-------+
Casting Value as Double
df.withColumn("Value", $"Value".cast(DoubleType)).orderBy($"Value").show
+-----+-------+
|Value|another|
+-----+-------+
| 2.0| a|
| 10.0| b|
+-----+-------+